Social Data Analysis & Vizualisation - DTU, Spring 2022
- Laurine Dargaud, Alba Reinders Sánchez and Alejandro Valverde -
The purpose of our project is to analyze the real impact of the measure that took place in the city center of Madrid called Madrid Central.
To sum up, Madrid Central was an environment measure that entered in force the 30th of November 2018, it prohibited all polluting vehicles from entering the "Centro" district of Madrid. Unfortunately, it was taken down for political and economic reasons the 1st of July 2019.
In this study, we aim to evaluate the effectiveness of this vehicle regulation on pollution by analyzing traffic and air quality.

We use multiple datasets, all of them from the Open Data platform of the Town hall of Madrid.

The datasets we use are:
Air Quality Data
Calidad del aire. Datos horarios desde 2001 (Air quality. Hourly data since 2001):
This data has in each file different air quality measures from multiple measurement stations. There are 24 stations all around the city of Madrid, with different gases measured in each one of them.
Calidad del aire. Estaciones de control (Air quality. Control stations):
This is the geodata related to the measurement points. For plotting using Bokeh we had to change the format of how the spatial coordinates are presented.
Traffic Data
Tráfico. Histórico de datos del tráfico desde 2013 (Traffic. Historic traffic data since 2013):
This data has in separated files the recorded traffic measurements from more than 3000 points, taken every 15 minutes. This is the core dataset we use for our traffic analysis, but, to reduce the size of the data, we preprocess it to only use the average measure on each point per day.
This is the geodata related to the measurement points. For plotting using Bokeh we had to change the format of how the spatial coordinates are presented.
Districts Geodata
Distritos municipales de Madrid (Municipal districts of Madrid):
The data has geodata of the limits of each of the districts of Madrid
We decided on these two datasets because we wanted to discover if Madrid Central achieved the objectives of reducing air pollution in the area, and we also wanted to inspect how effective and how big was the change in the traffic in and out of Madrid Central.
Additionally, we desired to analyze the relation between air quality and traffic in the city of Madrid.
While investigating this topic, we saw a lot of articles about Madrid Central. Some of them, saying how good and beneficial it was. Others, saying the exact opposite. Sadly, most of the news and articles were quite influenced by political ideas. This is because the measure was imposed by a political party, against the wishes of the opposing political party. Therefore, we feel they were biased, and did not rely on data and science.
We aim to fix this, trying to give an answer to how effective the measure really was, based on data, using visualizations and data analysis. We also want to present this in an informative and interactive way.
# CHANGE THE FOLLOWING CONSTANTS IF NEEDED
DOWNLOAD_RAW_AIR_QUALITY_DATA = False # Dowload or not the air quality data from Spanish database
PREPROCESS_AIR_QUALITY_DATA = False # Preprocess or not the air quality data
GET_FINAL_AIR_QUALITY_DATA = True # Open the Google Drive download URL to get the final air quality dataset
DOWNLOAD_RAW_TRAFFIC_DATA = False # Dowload or not the traffic data from Spanish database
PREPROCESS_TRAFFIC_DATA = False # Preprocess or not the traffic data
GET_FINAL_TRAFFIC_DATA = True # Open the Google Drive download URL to get the final traffic dataset
SAVE_PLOTS = False # If true, save the bokeh plots as html. Othewise, show them only
# IMPORT
import pandas as pd
import numpy as np
import seaborn as sns
import time
from datetime import datetime as dt
import utm
import json
from tqdm import tqdm
import requests
import zipfile
import os, io
import glob
import webbrowser
from IPython.display import display
from collections import defaultdict
import matplotlib.pyplot as plt
from matplotlib.path import Path
# folium
import folium
import folium.plugins
from folium.plugins import TimestampedGeoJson
# bokeh
import bokeh.palettes
from bokeh.plotting import figure, show, output_file, save, reset_output
from bokeh.io import output_notebook
from bokeh.models import HoverTool, Legend, ColumnDataSource, Span, ColorBar, DatetimeTickFormatter, CustomJS, Button, FactorRange
from bokeh.models.widgets import Panel, Tabs
from bokeh.tile_providers import get_provider, CARTODBPOSITRON
from bokeh.layouts import layout
from bokeh.transform import linear_cmap, factor_cmap
output_notebook()
np.random.seed(42)
# CONSTANTS
# FOR COLORS
def get_color_from_palette(color):
""" Getting colors for plotting """
return tuple([int(c * 255) for c in color])
def get_dark_color_from_palette(color):
""" Getting darker colors for plotting """
return tuple([int(c * 200) for c in color])
PALETTE = "colorblind"
AIR_COLORS = [ get_color_from_palette(c) for c in sns.color_palette(PALETTE, 6) ]
AIR_DARK_COLORS = [ get_dark_color_from_palette(c) for c in sns.color_palette(PALETTE, 6) ]
DISTRICT_COLORS = [ get_color_from_palette(c) for c in sns.color_palette(PALETTE, 21) ]
DISTRICT_DARK_COLORS = [ get_dark_color_from_palette(c) for c in sns.color_palette(PALETTE, 21) ]
MADRID_IN_OUT_COLORS = [ get_color_from_palette(c) for c in sns.color_palette(PALETTE, 2) ]
MADRID_IN_OUT_DARK_COLORS = [ get_dark_color_from_palette(c) for c in sns.color_palette(PALETTE, 2) ]
WEEK_COLORS = [ get_color_from_palette(c) for c in sns.color_palette(PALETTE, 8) ]
WEEK_DARK_COLORS = [ get_dark_color_from_palette(c) for c in sns.color_palette(PALETTE, 8) ]
# MADRID CENTRAL - KEY DATES
START_DATE = dt(2018, 11, 30, 0, 0, 0)
FINES_DATE = dt(2019, 3, 15, 0, 0, 0)
END_DATE = dt(2019, 7, 1, 0, 0, 0)
START_MC = time.mktime(START_DATE.timetuple())*1000
FINES_MC = time.mktime(FINES_DATE.timetuple())*1000
END_MC = time.mktime(END_DATE.timetuple())*1000
The first thing that has to be done in any data related task, is to preprocess and clean the data. In our case, this is very important, as we have as baseline many different files, with a lot of information. We have to reduce the dataset to obtain only the relevant parts.
Let's start by downloading all the air quality data directly from the Open Data Portal database of Madrid.
def download_air_quality_data():
""" Download all air quality data from 2016 until 2020.
Each file has a unique year id, we provide them in ID_list.
The name convention is
'https://datos.madrid.es/egob/catalogo/201200-{id}-calidad-aire-horario.zip'
In each ZIP file, we only extract the csv file of each month
"""
ID_list = {
2016:28,
2017:10306313,
2018:10306314,
2019:42,
2020:10306316
}
DATA_PATH = "data"
for year, year_id in tqdm(ID_list.items(), desc="Downloading air quality data", unit="file"):
url = f"https://datos.madrid.es/egob/catalogo/201200-{year_id}-calidad-aire-horario.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
for zipFile in z.infolist():
if (zipFile.filename.split('.')[-1] == 'csv'):
month_letters = zipFile.filename[:3]
# If it has been downloaded already, skip it
file_path = f"{DATA_PATH}/air-quality-{month_letters}-{year}-raw.csv"
if not os.path.isfile(file_path):
# Rename file
zipFile.filename = file_path
# Extract file
z.extract(zipFile)
if DOWNLOAD_RAW_AIR_QUALITY_DATA:
download_air_quality_data()
Once we downloaded the data, we need to preprocess it in order to obtain a suitable format for upcoming visualisations.
Indeed, each monthly CSV file contains one column for each hour with a associated validation column (see the Air Quality Data documentation for more details - in Spanish). Eventually we want a dataset the following columns only:
For this purpose, we need to apply the following steps on each monthly file:
datetime columnThen we can concatenate all monthly dataframes into one final dataset: air_quality_data.csv.
WARNING: the following air quality data processing takes time (about 25 minutes)!
If you just want to get the final air_quality_data.csv file, set GET_FINAL_AIR_QUALITY_DATA = True only
if PREPROCESS_AIR_QUALITY_DATA:
def format_time(aRow):
""" Generate a datetime format based on ANO, MES, DIA and HOUR """
hour = int(aRow.HOUR[1:])
if hour != 0:
# special case: midnight >> H24 == H00 for the day before
return pd.to_datetime(dt(aRow.ANO, aRow.MES, aRow.DIA, hour))
return pd.to_datetime(dt(aRow.ANO, aRow.MES, aRow.DIA, hour)+timedelta(days=1))
def format_PUNTO_MUESTREO(aRow):
""" The format of PUNTO_MUESTRO data is: {ID}_{magnitud}_{technical ID}.
We are only interested in the ID of the air quality station"""
return int(aRow.PUNTO_MUESTREO.split('_')[0])
def dms2dd(s):
""" From DMS (ex: 3º 42' 25.64'' O) to Longitude, Latitude"""
degrees, minutes, seconds, direction = re.split('[° \' " ]+', s)
dd = float(degrees[:-1]) + float(minutes)/60 + float(seconds)/(60*60);
if direction in ('S','O'):
dd*= -1
return dd
def format_lonlat(aRow):
return pd.Series([dms2dd(aRow.longitude), dms2dd(aRow.latitude)])
def dd2UTM(lat, lon):
""" From Lat, Lon to UTM """
r_major = 6378137.000
x = r_major * np.radians(lon)
scale = x/lon
y = 180.0/np.pi * np.log(np.tan(np.pi/4.0 +
lat * (np.pi/180.0)/2.0)) * scale
return x, y
def format_utm(aRow):
x, y = dd2UTM(aRow.latitude, aRow.longitude)
return pd.Series([x,y])
def process_air_quality_csv(aCSVFileName):
# load as dataframe
df = pd.read_csv(aCSVFileName, sep=';')
columns = df.columns.values
# drop validation columns
# NOTE: WE CONSIDER ALL DATA AS VALIDATED
df = df.drop(columns[[s.startswith('V') for s in columns]], axis=1)
# melt hours into one column
df = pd.melt(df,
id_vars = ['PROVINCIA','MUNICIPIO','ESTACION','MAGNITUD','PUNTO_MUESTREO','ANO','MES','DIA'],
value_name = 'value', var_name='HOUR')
# replace H24 per H00
df = df.replace('H24','H00')
# generate the timestamp column
df['datetime'] = df.apply(format_time, axis=1)
# format PUNTO_MUESTREO to keep the code only
df['PUNTO_MUESTREO'] = df.apply(format_PUNTO_MUESTREO, axis=1)
# drop useless time columns
df = df.drop(['ANO','MES','DIA','HOUR'], axis=1)
return df
# process and export all csv
DATA_PATH = 'data/'
for filename in tqdm(glob.iglob(DATA_PATH + 'air-quality-*-raw.csv'), desc="Processing air quality data", unit="file"):
name = filename.split('\\')[1]
new_name = name[:-8] +'.csv'
df_processed = process_air_quality_csv(DATA_PATH+name)
df_processed.to_csv(DATA_PATH+new_name, index=False, encoding='utf-8')
os.remove(DATA_PATH+name)
# gather all processed files in one csv
all_dfs = []
for filename in tqdm(glob.iglob(DATA_PATH + 'air-quality-*.csv'), desc="Merging air quality data", unit="file"):
name = filename.split('\\')[1]
all_dfs.append(pd.read_csv(DATA_PATH+name))
os.remove(DATA_PATH+name)
df_air_quality = pd.concat(all_dfs, axis=0)
df_air_quality.to_csv('air_quality_data.csv', index=False)
if GET_FINAL_AIR_QUALITY_DATA:
print('DOWNLOAD air_quality_data.csv in ./data/')
webbrowser.open('https://drive.google.com/uc?export=download&id=1Xh6Ce9mBVx2A1NqG8SUe4yUX_voswct8', new=2)
DOWNLOAD air_quality_data.csv in ./data/
We can then join the air quality data with the air quality stations and air information tables:
# load air quality stations
df_stations = pd.read_csv('shared_data/air_quality/air_quality_stations.csv')
# load gas infos table
df_gas = pd.read_csv('shared_data/air_quality/air_quality_magnitud.csv', sep=';')
df_gas['unit_per_m3_original'] = df_gas['unit_per_m3']
# load air quality data
df_air = pd.read_csv('data/air_quality_data.csv')
# converting Date to datetime type
df_air["datetime"] = pd.to_datetime(df_air["datetime"])
# merge with air quality stations
df_air = pd.merge(df_air, df_stations, left_on = 'PUNTO_MUESTREO', right_on='punto_muestreo', how='left').drop('PUNTO_MUESTREO', axis=1)
# merge with air quality recorded gas
df_air = pd.merge(df_air, df_gas, left_on = 'MAGNITUD', right_on='magnitud_id', how='left').drop('MAGNITUD', axis=1)
We restrict the time period until the start of COVID lockdown for not being biased, and we remove meaningless concentration values. We obtain our final dataset, ready for analysis!
# restrict up to February 2020 included
df_air = df_air[(df_air['datetime'] < pd.to_datetime('2020-02-29 23:59:59'))]
# remove negative values = errors
df_air = df_air[(df_air.value > 0) ]
df_air.head()
| PROVINCIA | MUNICIPIO | ESTACION | value | datetime | punto_muestreo | name | longitude | latitude | altitude | utm_x | utm_y | magnitud_id | formula | unit_per_m3 | european_standard | unit_per_m3_original | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 28 | 79 | 4 | 7.0 | 2016-04-01 01:00:00 | 28079004 | Plaza de España | -3.712197 | 40.423883 | 637 | -413239.904502 | 4.927732e+06 | 1 | SO2 | µg | 125 | µg |
| 1 | 28 | 79 | 4 | 8.0 | 2016-04-02 01:00:00 | 28079004 | Plaza de España | -3.712197 | 40.423883 | 637 | -413239.904502 | 4.927732e+06 | 1 | SO2 | µg | 125 | µg |
| 2 | 28 | 79 | 4 | 10.0 | 2016-04-03 01:00:00 | 28079004 | Plaza de España | -3.712197 | 40.423883 | 637 | -413239.904502 | 4.927732e+06 | 1 | SO2 | µg | 125 | µg |
| 3 | 28 | 79 | 4 | 7.0 | 2016-04-04 01:00:00 | 28079004 | Plaza de España | -3.712197 | 40.423883 | 637 | -413239.904502 | 4.927732e+06 | 1 | SO2 | µg | 125 | µg |
| 4 | 28 | 79 | 4 | 8.0 | 2016-04-05 01:00:00 | 28079004 | Plaza de España | -3.712197 | 40.423883 | 637 | -413239.904502 | 4.927732e+06 | 1 | SO2 | µg | 125 | µg |
The data we want to work with is very large, thus we need to download it from the source as it is not possible to upload it to the version control system we use (GitHub).
def download_data_traffic():
""" Download all traffic data from January 2016 (ID=32) until February 2020 (ID=81)
Some files do not follow the same naming convention, and need repairing.
The name convention that most files follow is '{num_month}-{num_year}.yaml',
so everyone will follow that
"""
FIRST_MONTH_ID = 32
LAST_MONTH_ID = 81
DATA_PATH = "data/traffic2"
for month_id in tqdm(range(FIRST_MONTH_ID, LAST_MONTH_ID+1), desc="Downloading data", unit="file"):
# Get month number, from 1 to 12
current_month = ((month_id - FIRST_MONTH_ID) % 12) + 1
# Get year number, from 2016 to 2021
current_year = int((month_id - FIRST_MONTH_ID) / 12) + 2016
# If it has been downloaded already, skip it
file_path = f"{DATA_PATH}/{current_month:02d}-{current_year}.csv"
if not os.path.isfile(file_path):
url = f"https://datos.madrid.es/egob/catalogo/208627-{month_id}-transporte-ptomedida-historico.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
zipcsv = z.infolist()[-1]
# Rename file
zipcsv.filename = file_path
# Extract file
z.extract(zipcsv)
if DOWNLOAD_RAW_TRAFFIC_DATA:
download_data_traffic()
Before diving into the actual data, we need to contextualize. Madrid is divided into districts. There are 21 one of them, being the area of Madrid Central exactly the same as the Centro district area (thus the name).
We have a dataset of where the measure of traffic points are located. As expected, they are not evenly distributed. Our first task is to see in which district each traffic measurement point is located.
traffic_points = pd.read_csv("shared_data/traffic_points/pmed_trafico_03052016.csv", sep=";")
traffic_points.head()
| idelem | tipo_elem | cod_cent | nombre | st_x | st_y | |
|---|---|---|---|---|---|---|
| 0 | 1044 | 494 | 03FT08PM01 | 03FT08PM01 | 438963.314635 | 4.474734e+06 |
| 1 | 3600 | 494 | PM30901 | PM30901 | 443729.047369 | 4.473268e+06 |
| 2 | 3705 | 494 | PM41451 | PM41451 | 439858.261097 | 4.471574e+06 |
| 3 | 6823 | 494 | PM41453 | PM41453 | 439188.095183 | 4.470895e+06 |
| 4 | 7033 | 495 | 01015 | Pº Castellana S-N - Pl. Colon-Hermosilla | 441569.555897 | 4.475502e+06 |
First we need to calculate the correct utm for displaying in bokeh maps.
def utm_from_latlon(lat, lon):
""" From a given lat and lon, calculates the correct UTM coordinates to
plot using `bokeh`
"""
r_major = 6378137.000
x = r_major * np.radians(lon)
scale = x/lon
y = 180.0/np.pi * np.log(np.tan(np.pi/4.0 +
lat * (np.pi/180.0)/2.0)) * scale
return x, y
def get_lat_lon_utm(row):
""" From a row containing the columns 'st_x' and 'st_y' calculates both the lat and lon
and the correct UTM coordinates to plot using `bokeh`
"""
# 30 and 'T' is the zone of Madrid
lat, lon = utm.to_latlon(row["st_x"], row["st_y"], 30, "T")
x, y = utm_from_latlon(lat, lon)
return pd.Series([lat, lon, x, y])
traffic_points[["latitude", "longitude", "utm_x", "utm_y"]] = traffic_points.apply(get_lat_lon_utm, axis=1)
traffic_points = traffic_points.rename(columns = {'nombre':'name'})
traffic_points.head()
| idelem | tipo_elem | cod_cent | name | st_x | st_y | latitude | longitude | utm_x | utm_y | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1044 | 494 | 03FT08PM01 | 03FT08PM01 | 438963.314635 | 4.474734e+06 | 40.421001 | -3.719488 | -414051.481782 | 4.927311e+06 |
| 1 | 3600 | 494 | PM30901 | PM30901 | 443729.047369 | 4.473268e+06 | 40.408129 | -3.663184 | -407783.811885 | 4.925429e+06 |
| 2 | 3705 | 494 | PM41451 | PM41451 | 439858.261097 | 4.471574e+06 | 40.392598 | -3.708640 | -412843.963659 | 4.923159e+06 |
| 3 | 6823 | 494 | PM41453 | PM41453 | 439188.095183 | 4.470895e+06 | 40.386431 | -3.716471 | -413715.710072 | 4.922257e+06 |
| 4 | 7033 | 495 | 01015 | Pº Castellana S-N - Pl. Colon-Hermosilla | 441569.555897 | 4.475502e+06 | 40.428107 | -3.688839 | -410639.639249 | 4.928350e+06 |
Then we load the districts information to display them in the map.
with open("shared_data/districts/districts.geojson", "r", encoding='utf-8') as geojson:
geodata = json.load(geojson)
df_districts = pd.DataFrame([], columns=["name", "latitude",
"longitude", "utm_x",
"utm_y"])
for district in geodata["features"]:
# Get district name
district_name = district["properties"]["NOMBRE"]
# Get district coordinates
district_coord = district["geometry"]["coordinates"][0]
df_district = pd.DataFrame(district["geometry"]["coordinates"][0], columns=["st_x", "st_y"])
df_district["name"] = district_name
# Calculate correct utm
df_district[["latitude", "longitude", "utm_x", "utm_y"]] = df_district.apply(get_lat_lon_utm, axis=1)
df_district = df_district.drop(columns=["st_x", "st_y"])
# Append to all districts dataframe
df_districts = pd.concat([df_districts, df_district]).reset_index(drop=True)
district_name = df_districts["name"].unique()
df_districts
| name | latitude | longitude | utm_x | utm_y | |
|---|---|---|---|---|---|
| 0 | Centro | 40.407345 | -3.693162 | -411120.867401 | 4.925314e+06 |
| 1 | Centro | 40.407196 | -3.693202 | -411125.341089 | 4.925293e+06 |
| 2 | Centro | 40.406986 | -3.693227 | -411128.197118 | 4.925262e+06 |
| 3 | Centro | 40.407127 | -3.693677 | -411178.251532 | 4.925282e+06 |
| 4 | Centro | 40.407256 | -3.693849 | -411197.420454 | 4.925301e+06 |
| ... | ... | ... | ... | ... | ... |
| 9466 | Moncloa - Aravaca | 40.469899 | -3.802628 | -423306.659924 | 4.934464e+06 |
| 9467 | Moncloa - Aravaca | 40.469823 | -3.802359 | -423276.645066 | 4.934452e+06 |
| 9468 | Moncloa - Aravaca | 40.469748 | -3.802093 | -423247.039940 | 4.934441e+06 |
| 9469 | Moncloa - Aravaca | 40.469672 | -3.801822 | -423216.881019 | 4.934430e+06 |
| 9470 | Moncloa - Aravaca | 40.469448 | -3.801040 | -423129.799177 | 4.934398e+06 |
9471 rows × 5 columns
Now, we can save in which district is each traffic point.
traffic_points["district"] = "None"
points = traffic_points[["utm_x", "utm_y"]]
for name in district_name:
path = Path(df_districts[df_districts["name"] == name][["utm_x", "utm_y"]])
points_in_path_mask = path.contains_points(points)
traffic_points.loc[points_in_path_mask, "district"] = name
# Discard the traffic points outside any district of Madrid, as they are outside the city
traffic_points = traffic_points.drop(traffic_points[traffic_points["district"] == "None"].index)\
.reset_index(drop=True)
traffic_points.head()
| idelem | tipo_elem | cod_cent | name | st_x | st_y | latitude | longitude | utm_x | utm_y | district | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1044 | 494 | 03FT08PM01 | 03FT08PM01 | 438963.314635 | 4.474734e+06 | 40.421001 | -3.719488 | -414051.481782 | 4.927311e+06 | Moncloa - Aravaca |
| 1 | 3600 | 494 | PM30901 | PM30901 | 443729.047369 | 4.473268e+06 | 40.408129 | -3.663184 | -407783.811885 | 4.925429e+06 | Moratalaz |
| 2 | 3705 | 494 | PM41451 | PM41451 | 439858.261097 | 4.471574e+06 | 40.392598 | -3.708640 | -412843.963659 | 4.923159e+06 | Carabanchel |
| 3 | 6823 | 494 | PM41453 | PM41453 | 439188.095183 | 4.470895e+06 | 40.386431 | -3.716471 | -413715.710072 | 4.922257e+06 | Carabanchel |
| 4 | 7033 | 495 | 01015 | Pº Castellana S-N - Pl. Colon-Hermosilla | 441569.555897 | 4.475502e+06 | 40.428107 | -3.688839 | -410639.639249 | 4.928350e+06 | Salamanca |
The next step is to load the datasets for traffic information. This datasets have a lot of rows, as each of the more than 3000 measurement points record mutiple parameters each 15 minutes, so a rough approximation of how many rows each month file has is:
$$ 30(days) \cdot 24(hours) \cdot 4(measures\_per\_hour) \cdot 3000(traffic\_points) = 8640000 $$And once again, if we take into account that we are using data from 2016 until the end of 2021, a more accurate row count would be:
$$ 4(years) \cdot 365(days) \cdot 24(hours) \cdot 4(measures\_per\_hour) \cdot 3000(traffic\_points) = 42048000 $$This amount of data (more than 630 million rows) is too much to handle efficiently, and obtain relevant information. To reduce the amount of rows, we decide on keeping the average intensity of traffic (Number of cars) per day in each district. That way, we will have:
$$ 4(years) \cdot 365(days) \cdot 21(number\_districts) = 30660 $$which is more manageable number, from where we aspire to detect the relevant information in the data. Around 13714 times less data.
def process_traffic_data(filepath, traffic_points_df):
""" Function to process each traffic data file. This preoprocess has as objective to reduce
the dimensionality of the data, only keeping one value per district per day, reducing this
way the number of rows to handle.
Arguments:
filepath -> path to load the csv
traffic_points_df -> traffic_points dataset (where they are located)
"""
# Load file
traffic_df = pd.read_csv(filepath, sep=";")
# For god knows why, there is one file that is separated by ',' instead of ';'
# so we reread the file if it only has one column
if len(traffic_df.columns) == 1:
traffic_df = pd.read_csv(filepath, sep=",")
# If the 'idelem' column does not exists, is because is called 'id', so rename column
if "idelem" not in traffic_df.columns:
traffic_df = traffic_df.rename(columns = {'id':'idelem'})
# Use only the traffic points for whom we have information
traffic_df = traffic_df[traffic_df["idelem"].isin(traffic_points_df["idelem"])]
# Transform date to datime type
traffic_df["fecha"] = pd.to_datetime(traffic_df["fecha"])
# Get date in separate columns
traffic_df["day"] = traffic_df["fecha"].dt.day
traffic_df["month"] = traffic_df["fecha"].dt.month
traffic_df["year"] = traffic_df["fecha"].dt.year
# Group by id and date, up to day, and get the average intensity perr traffic point
traffic_df = traffic_df.groupby(["idelem",
"day",
"month",
"year"]).agg(mean_intensity=("intensidad", "mean")).reset_index()
# Merge with the traffic points to get the district for each point
traffic_df = traffic_df.merge(traffic_points_df[["idelem", "district"]], on="idelem")
# Save also the traffic points as separated files
traffic_points_intensity_df = traffic_df.copy()
# Group by again, to get only one value per district per day
traffic_df = traffic_df.groupby(["district", "day", "month", "year"]).mean()["mean_intensity"].reset_index()
# Get the date and day of the week for plotting purpose
traffic_df["date"] = pd.to_datetime(traffic_df[["day", "month", "year"]])
traffic_df["day_of_week"] = traffic_df["date"].dt.day_name()
traffic_points_intensity_df["date"] = pd.to_datetime(traffic_points_intensity_df[["day", "month", "year"]])
traffic_points_intensity_df["day_of_week"] = traffic_points_intensity_df["date"].dt.day_name()
# Save idelem as integer
traffic_points_intensity_df["idelem"] = traffic_points_intensity_df["idelem"].astype(int)
return traffic_df, traffic_points_intensity_df
def load_all_trafic_data(traffic_points_df):
""" Function to load all trafic data from the data folder,
after being processed
Arguments:
traffic_points_df -> traffic_points dataset (where they are located)
"""
DATA_PATH = "data/traffic"
traffic_data = pd.DataFrame([], columns=["district", "date", "day_of_week",
"day", "month", "year", "mean_intensity"])
traffic_points_data = pd.DataFrame([], columns=["district", "date", "day_of_week",
"day", "month", "year", "mean_intensity"])
for filepath in tqdm(os.listdir(DATA_PATH), desc="Processing files", unit="file"):
traffic_df, traffic_points_intensity_df= process_traffic_data(os.path.join(DATA_PATH, filepath), traffic_points_df)
traffic_data = pd.concat([traffic_data, traffic_df])
traffic_points_data = pd.concat([traffic_points_data, traffic_points_intensity_df])
return (traffic_data.sort_values(by=["district", "date"]).reset_index(drop="True"),
traffic_points_data.sort_values(by=["district", "date", "idelem"]).reset_index(drop="True"))
# specify path
df_traffic_path = "shared_data/traffic_intensity.csv"
df_traffic_intensity_path = "data/traffic_points_intensity.csv"
if GET_FINAL_TRAFFIC_DATA:
print('DOWNLOAD traffic_points_intensity.csv in ./data/')
webbrowser.open('https://drive.google.com/uc?export=download&id=1xYOBUeYF7G-0IaHt8wyuITfYp_QkcchN', new=2)
DOWNLOAD traffic_points_intensity.csv in ./data/
if not PREPROCESS_TRAFFIC_DATA:
total_traffic_df = pd.read_csv(df_traffic_path)
total_traffic_df["date"] = pd.to_datetime(total_traffic_df["date"])
traffic_points_intensity_df = pd.read_csv(df_traffic_intensity_path)
traffic_points_intensity_df["date"] = pd.to_datetime(traffic_points_intensity_df["date"])
traffic_points_intensity_df["idelem"] = traffic_points_intensity_df["idelem"].astype(int)
else:
total_traffic_df, traffic_points_intensity_df = load_all_trafic_data(traffic_points)
total_traffic_df.to_csv(df_traffic_path, index=False)
traffic_points_intensity_df.to_csv(df_traffic_intensity_path, index=False)
p = figure(title="Districts of Madrid", x_axis_type="mercator", y_axis_type="mercator",
height=600, width=600, tools="")
p.toolbar.logo = None
p.toolbar_location = None
for name, color in zip(district_name, DISTRICT_COLORS):
# Districts
source_dict = dict(utm_x = [[x for x in df_districts[df_districts["name"] == name]["utm_x"]]],
utm_y = [[y for y in df_districts[df_districts["name"] == name]["utm_y"]]],
name = [name])
source = ColumnDataSource(source_dict)
p.patches(xs="utm_x", ys="utm_y", color=color, line_width=3, alpha=0.4,
source=source, muted=False, muted_alpha=0.1)
cartodb = get_provider(CARTODBPOSITRON)
p.add_tile(cartodb)
TOOLTIPS = [
("", "@name"),
]
p.add_tools(HoverTool(tooltips=TOOLTIPS))
p.background_fill_color = None
p.border_fill_color = None
p = layout(p, sizing_mode='scale_both')
if SAVE_PLOTS:
output_file("html_plots/districts.html", title="Districts Map")
save(p)
reset_output()
output_notebook()
else:
show(p)
Map of Air Quality Stations
First let's get a better idea of air quality stations location, and the different gases they monitor. Let's plot a map to do so!
print(f'There are {len(df_stations.name.unique())} air quality stations in Madrid:\n')
print(', '.join(df_stations.name.unique()))
There are 24 air quality stations in Madrid: Plaza del Carmen, Plaza de España, Barrio del Pilar, Escuelas Aguirre, Cuatro Caminos, Ramón y Cajal, Vallecas, Arturo Soria, Villaverde, Farolillo, Moratalaz, Casa de Campo, Barajas Pueblo, Méndez Álvaro, Castellana, Retiro, Plaza Castilla, Ensanche de Vallecas, Urbanización Embajada, Plaza Elíptica, Sanchinarro, El Pardo, Juan Carlos I, Tres Olivos
# load MC area
cm_points = pd.read_csv('shared_data/districts/central_madrid_points.csv')
# add is_in_MC column for each air quality station
points = df_stations[["utm_x", "utm_y"]].values
path = Path(cm_points[["utm_x", "utm_y"]].values)
points_in_path_mask = path.contains_points(points)
df_stations["is_in_MC"] = False
df_stations.loc[points_in_path_mask, "is_in_MC"] = True
# add the list of tracked gas
def add_monitored_gas_list(aRow):
gas_list = df_air[df_air['name'] == aRow['name']]['formula'].unique()
return ', '.join(gas_list)
df_stations['gas_list'] = df_stations.apply(add_monitored_gas_list, axis=1)
# plot map
p = figure(title="Air quality stations in Madrid", x_axis_type="mercator", y_axis_type="mercator")
p.background_fill_color = None
source_in, source_out = ColumnDataSource(df_stations[df_stations.is_in_MC==True]), ColumnDataSource(df_stations[df_stations.is_in_MC==False])
cr_in = p.circle(x="utm_x", y="utm_y", size=20, color=MADRID_IN_OUT_COLORS[0], source=source_in)
cr_out = p.triangle(x="utm_x", y="utm_y", size=20, color=MADRID_IN_OUT_COLORS[1], source=source_out)
# Madrid Central lines
source_mc = ColumnDataSource(cm_points)
line_cm = p.line(x="utm_x", y="utm_y", color=MADRID_IN_OUT_COLORS[0], line_width=2,
source=source_mc, muted=False, muted_alpha=0.3)
cartodb = get_provider(CARTODBPOSITRON)
p.add_tile(cartodb)
p.add_tools(HoverTool(tooltips=[('Name', '@name'),('Monitored Pollutants','@gas_list')], renderers=[cr_in, cr_out]))
# add interactive legend
legend = Legend(items=[('Madrid Central Area', [line_cm]), ('IN Madrid Central area', [cr_in]), ('OUT of Madrid Central area', [cr_out])], location='center')
legend.click_policy="hide"
legend.location = "top_left"
p.add_layout(legend)
p = layout(p, sizing_mode='scale_both')
if SAVE_PLOTS:
output_file("html_plots/air_quality_stations.html")
save(p)
reset_output()
output_notebook()
else:
show(p)
We notice that only one air quality station is found in the area of Madrid Central: it's Plaza del Carmen, that monitors SO2, CO, NO, NO2, NOx and O3.
Boxplots of gas concentrations
We can plot boxplots of gas concentrations to know how data is distributed.
print(f'{len(df_gas.formula.unique())} different gas concentrations are monitored through the air quality stations of Madrid:')
gas_list_to_print = []
for i, aRow in df_gas.iterrows():
gas_name, gas_unit = aRow.formula, aRow.unit_per_m3_original
gas_list_to_print.append(f'{gas_name} ({gas_unit}/m3)')
print(', '.join(gas_list_to_print))
17 different gas concentrations are monitored through the air quality stations of Madrid: SO2 (µg/m3), CO (mg/m3), NO (µg/m3), NO2 (µg/m3), PM2.5 (µg/m3), PM10 (µg/m3), NOx (µg/m3), O3 (µg/m3), TOL (µg/m3), BEN (µg/m3), EBE (µg/m3), MXY (µg/m3), PXY (µg/m3), OXY (µg/m3), TCH (mg/m3), CH4 (mg/m3), NMHC (mg/m3)
def get_air_quality_boxplots(figsize=(10,10)):
all_gases = df_air.formula.unique()
plt.figure(figsize=figsize)
plt.boxplot([df_air[(df_air.formula==(aGas))]['value'].values for aGas in all_gases])
plt.xticks(range(1, len(all_gases)+1), all_gases)
plt.title('Boxplot of all gas concentrations')
plt.xlabel('Gas formula')
plt.ylabel('Concentration in µg or mg / m3')
plt.show()
get_air_quality_boxplots()
We note many outliers, and obvious huge outliers for SO2 and NO. They should be recording errors from air quality sensors, so we can remove them from the data.
df_air = df_air[df_air['value']< 1000]
We can plot the boxplots again.
get_air_quality_boxplots()
The most widespread pollutants are O3, NOx, NO2 and PM10.
Bar chart of average gas concentration per air quality station
Let's understand the average concentration of pollutant for each station.
all_stations = df_stations['name'].unique()
all_pollutants = list(df_gas['formula'].unique())
pollutant_colors = [get_color_from_palette(c) for c in sns.color_palette(PALETTE, len(all_pollutants))]
data = {'station' : all_stations}
for aGas in all_pollutants:
data[aGas]=[]
# get the average of each concentration per station
for aStation in df_air.name.unique():
df_air_gas_filtered = df_air[df_air.name == aStation]
avg_values_dict = dict(df_air_gas_filtered.groupby('formula').mean()['value'])
for aGas in all_pollutants:
if aGas in avg_values_dict.keys():
data[aGas].append(avg_values_dict[aGas])
else:
data[aGas].append(0)
p = figure(x_range=all_stations, height=500, title="Average concentration of gas per hour per station",
toolbar_location=None, tools="hover", tooltips="@station $name : @$name")
p.vbar_stack(all_pollutants, x='station', width=0.9, source=data, color=pollutant_colors,
legend_label=all_pollutants)
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
p = layout(p, sizing_mode='stretch_both')
if SAVE_PLOTS:
output_file("html_plots/air_quality_barchart_average_concentrations.html")
save(p)
reset_output()
output_notebook()
else:
show(p)
Multilines chart of gas concentrations over time for given stations
def get_month_year(aRow):
return aRow.name.month_name() + ' ' + str(aRow.name.year)
def get_stats_dataframe(station):
df1 = df_air[(df_air.name == station)][['formula','value','datetime']]
df1 = df1.pivot(index='datetime', columns='formula', values='value').reset_index()
df1['datetime'] = df1.datetime.dt.floor('D')
df1['datetime'] = df1['datetime'].apply(lambda dt: dt.replace(day=1))
tracking_gas = df1.columns.values[1:]
# cut outliers!!!
if station == 'Plaza del Carmen':
df1 = df1[df1.SO2 <500 ]
df1 = df1[df1.CO < 10 ]
# homogenization (everything in µg/m^3)
for aGas in df1.columns.values[1:]:
unit = df_gas[df_gas.formula==aGas].unit_per_m3.values[0]
if ((unit == 'mg') or (unit == '10μg')):
# we change to 10
df1[aGas] = df1[aGas] * 100 #*1000 to get it in μg/m3 exactly
idx = df_gas[df_gas.formula==aGas].index
df_gas.loc[idx, 'unit_per_m3'] = '10μg'
# get all stats
df1_stats = df1.groupby(['datetime']).agg(['mean','std'])
df1_stats.columns = df1_stats.columns.to_flat_index()
df1_stats.columns = pd.Index([a+'_'+b for a,b in df1_stats.columns])
df1_stats['date'] = df1_stats.apply(get_month_year, axis=1)
for aGas in tracking_gas:
meanColumn = aGas+'_mean'
stdColumn = aGas+'_std'
df1_stats[aGas+'_upper'] = df1_stats[meanColumn]+df1_stats[aGas+'_std']
df1_stats[aGas+'_lower'] = df1_stats[meanColumn]-df1_stats[aGas+'_std']
return tracking_gas, df1_stats
def get_bokeh_viz_evolution_over_time(df1_stats, aText, tracking_gas):
# create annotations for time marks
startMC_span = Span(location=START_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
finesMC_span = Span(location=FINES_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
endMC_span = Span(location=END_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
cds_stats = ColumnDataSource(data=df1_stats)
p = figure(
x_axis_type="datetime",
width=950,
height=450,
title='Evolution of pollutant concentrations over time in '+aText,
y_axis_label='Gas Concentration',
x_axis_label='Date'
)
# add the data of each gas + interactive legend
lines, circles, bands = {}, {}, {}
items = []
for i, aGas in enumerate(tracking_gas):
unit = df_gas[df_gas.formula==aGas].unit_per_m3.values[0]
# add line of mean
lines[aGas] = p.line('datetime', aGas+'_mean', source=cds_stats, color = AIR_COLORS[i])
# add dots of mean
circles[aGas] = p.circle('datetime',aGas+'_mean', source=cds_stats, color=AIR_COLORS[i], size=5)
p.add_tools(HoverTool(tooltips=[
('Gas',aGas),
('Date', '@date'),
('Average value', f'@{aGas}_mean {unit}/m3'),
('Standard Deviation', f'@{aGas}_std {unit}/m3')
], renderers=[circles[aGas]]))
# add variance
bands[aGas] = p.varea(x='datetime', y1=aGas+'_upper', y2=aGas+'_lower', source=cds_stats, fill_alpha=0.1, fill_color=AIR_COLORS[i])
# append legend list
items.append((f'{aGas} ({unit}/m3)', [lines[aGas], circles[aGas], bands[aGas]]))
# add legend
legend = Legend(items=items, location='center')
legend.click_policy="hide"
legend.location = 'top_left'
p.add_layout(legend)
# add annotations to plot
p.add_layout(startMC_span)
p.add_layout(finesMC_span)
p.add_layout(endMC_span)
return layout(p, sizing_mode='stretch_both')
def get_bokeh_tabs_air_stations(aListOfStations):
tabs = []
for station in aListOfStations:
all_tracking_gas, df_stats_station = get_stats_dataframe(station)
p = get_bokeh_viz_evolution_over_time(df_stats_station, station, all_tracking_gas)
tabs.append(Panel(child=p, title=station))
return Tabs(tabs=tabs)
stations = ['Plaza del Carmen', "Plaza de España", "Castellana", "Retiro", "Méndez Álvaro"]
# get Bokeh visualisation with a tab for each required air quality station
tabs = get_bokeh_tabs_air_stations(stations)
if SAVE_PLOTS:
output_file("html_plots/air_quality_evolution_tabs.html")
save(tabs)
reset_output()
output_notebook()
else:
show(tabs)
Multines chart of the air quality progress based on the month of the previous year
To get a better idea of the air quality progress in a given location over time, let's plot the evolution based on the month of the previous year for recorded pollutants.
def get_df_comparison(station):
def get_progress_percent(aRow):
previousRow = df2_mean[(df2_mean.month==aRow.month) & (df2_mean.year==aRow.year-1)]
if len(previousRow) != 0:
previousRow = previousRow.iloc[0]
ratios = 100*(aRow - previousRow)/previousRow
datetime_to_keep = aRow.name
ratios.name = datetime_to_keep
ratios.month = datetime_to_keep.month
ratios.year = datetime_to_keep.year
return ratios
return None
df1 = df_air[(df_air.name == station)][['formula','value','datetime']]
df1 = df1.pivot(index='datetime', columns='formula', values='value').reset_index()
df1['datetime'] = df1.datetime.dt.floor('D')
df1['datetime'] = df1['datetime'].apply(lambda dt: dt.replace(day=1))
tracking_gas = df1.columns.values[1:]
# cut outliers!!!
if station == 'Plaza del Carmen':
df1 = df1[df1.SO2 <500 ]
df1 = df1[df1.CO < 10 ]
# get mean
df2_mean = df1.groupby(['datetime']).mean().reset_index()
df2_mean = df2_mean.set_index('datetime')
df2_mean["month"] = df2_mean.index.month
df2_mean["year"] = df2_mean.index.year
# get percentage
df2_ratios = df2_mean[df2_mean.year > 2016].apply(get_progress_percent, axis=1)
# get date display
df2_ratios['date'] = df2_ratios.apply(get_month_year, axis=1)
return tracking_gas, df2_ratios
def get_bokeh_viz_ratios(df2_ratios, aText, tracking_gas):
# create annotations for time marks
startMC_span = Span(location=START_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
finesMC_span = Span(location=FINES_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
endMC_span = Span(location=END_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
cds_ratios = ColumnDataSource(data=df2_ratios)
p = figure(
width=950,
height=450,
x_axis_type="datetime",
title='Comparisons of pollutant concentrations with the month of previous year in '+aText,
y_axis_label='+/- percentage based on the previous month',
x_axis_label='Date'
)
# add the data of each gas + interactive legend
items = []
lines, circles = {}, {}
for i, aGas in enumerate(tracking_gas):
unit = df_gas[df_gas.formula==aGas].unit_per_m3.values[0]
# add line
lines[aGas] = p.line('datetime', aGas, source=cds_ratios, color =AIR_COLORS[i])
# add dots
circles[aGas] = p.circle('datetime',aGas, source=cds_ratios, color=AIR_COLORS[i], size=5)
p.add_tools(HoverTool(tooltips=[
('Gas',aGas),
('Date', '@date'),
('Previous Month comparison','@'+aGas+'%')
], renderers=[circles[aGas]]))
# append legend item
items.append((f'{aGas} ({unit}/m3)', [lines[aGas], circles[aGas]]))
# add annotations to plot
p.add_layout(startMC_span)
p.add_layout(finesMC_span)
p.add_layout(endMC_span)
# add "zero" annotation
zero_line_span = Span(location=0, dimension='width', line_color='red', line_width=1, line_alpha=0.3)
p.add_layout(zero_line_span)
legend = Legend(items=items)
legend.click_policy="hide"
legend.location = 'top_left'
p.add_layout(legend)
p = layout(p, sizing_mode='stretch_both')
return p
def get_bokeh_viz_ratios_listStations(stations):
df_ratios = []
all_stations_gases = []
for aStation in stations:
all_tracking_gas, df_ratios_station = get_df_comparison(aStation)
df_ratios.append(df_ratios_station)
if len(all_stations_gases) == 0:
all_stations_gases = np.array(all_tracking_gas)
else:
all_stations_gases = np.intersect1d(all_stations_gases, np.array(all_tracking_gas))
df_ratios = pd.concat(df_ratios)
df_ratios.dropna(axis=1, how="any", inplace=True)
df_ratios = df_ratios.groupby('datetime').mean()
df_ratios['date'] = df_ratios.apply(get_month_year, axis=1)
return get_bokeh_viz_ratios(df_ratios, ', '.join(stations), all_stations_gases)
def get_bokeh_tabs_air_stations_for_ratios(aListOfStations, aListOfTabTitles=[]):
if len(aListOfTabTitles) == 0:
aListOfTabTitles = aListOfStations
tabs = []
for i in range (len(aListOfStations)):
tab_title, station = aListOfTabTitles[i], aListOfStations[i]
if isinstance(station, str):
# one station is provided
all_tracking_gas, df_stats_station = get_df_comparison(station)
p = get_bokeh_viz_ratios(df_stats_station, station, all_tracking_gas)
tabs.append(Panel(child=p, title=tab_title))
else:
p = get_bokeh_viz_ratios_listStations(station)
tabs.append(Panel(child=p, title=tab_title))
return Tabs(tabs=tabs)
stations = ['Plaza del Carmen', 'Castellana', 'Retiro']
tabs_comparison = get_bokeh_tabs_air_stations_for_ratios(stations)
if SAVE_PLOTS:
output_file("html_plots/air_quality_ratios_evolution_tabs.html")
save(tabs_comparison)
reset_output()
output_notebook()
else:
show(tabs_comparison)
Map of traffic points
We want to check first how many traffic points are in each district.
p = figure(title="Traffic measurement stations in Madrid", x_axis_type="mercator", y_axis_type="mercator",
height=700, width=800)
p.border_fill_color = None
# Madrid Central
source = ColumnDataSource(df_districts[df_districts["name"] == "Centro"])
p.line(x="utm_x", y="utm_y", color=MADRID_IN_OUT_COLORS[0], line_width=4,
source=source, legend_label="Madrid Central limit", muted=False, muted_alpha=0.3)
source = ColumnDataSource(traffic_points[traffic_points["district"] == "Centro"])
circles_in = p.circle(x="utm_x", y="utm_y", color=MADRID_IN_OUT_COLORS[0], line_width=1,
source=source, muted_alpha=0.3, size=5,
line_color=MADRID_IN_OUT_DARK_COLORS[0], legend_label="IN Madrid Central Area")
source = ColumnDataSource(traffic_points[traffic_points["district"] != "Centro"])
circles_out = p.triangle(x="utm_x", y="utm_y", color=MADRID_IN_OUT_COLORS[1], line_width=1,
source=source, muted_alpha=0.3, size=5,
line_color=MADRID_IN_OUT_DARK_COLORS[1], legend_label="OUT of Madrid Central Area")
# Hover tooltip
TOOLTIPS = [
("Name", "@name"),
("District", "@district")
]
p.add_tools(HoverTool(tooltips=TOOLTIPS, renderers=[circles_out, circles_in]))
cartodb = get_provider(CARTODBPOSITRON)
p.add_tile(cartodb)
# p.add_layout(p.legend[0], "right")
p.legend.click_policy = "mute"
p = layout(p, sizing_mode='scale_both')
if SAVE_PLOTS:
output_file("html_plots/traffic_points.html", title="Traffic stations")
save(p)
reset_output()
output_notebook()
else:
show(p)
traffic_points.groupby("district").agg(number_traffic_points=("idelem", "count")).sort_values("number_traffic_points").reset_index()
| district | number_traffic_points | |
|---|---|---|
| 0 | Barajas | 41 |
| 1 | Vicálvaro | 45 |
| 2 | Villa de Vallecas | 63 |
| 3 | Villaverde | 77 |
| 4 | Moratalaz | 114 |
| 5 | Usera | 121 |
| 6 | Chamberí | 134 |
| 7 | Retiro | 134 |
| 8 | Puente de Vallecas | 137 |
| 9 | Centro | 176 |
| 10 | Latina | 188 |
| 11 | San Blas - Canillejas | 193 |
| 12 | Salamanca | 199 |
| 13 | Hortaleza | 216 |
| 14 | Carabanchel | 220 |
| 15 | Tetuán | 226 |
| 16 | Arganzuela | 279 |
| 17 | Moncloa - Aravaca | 288 |
| 18 | Ciudad Lineal | 326 |
| 19 | Fuencarral - El Pardo | 332 |
| 20 | Chamartín | 357 |
From the table we can see that there is quite a big difference between the different traffic points, being the district with the least number Barajas (41) and the district with the biggest number Chamartín (357).
This is something we have to take into account, as too few datapoints may result in poor conclusions. Luckly, the Centro district has 176 measurement points, which we believe it's enough given the small area (5.23 km²) of this district.
Now let's compare the traffic intensity in each district, to see which districts are busier.
df_traffic_points = traffic_points.copy()
df_traffic_points = pd.merge(df_traffic_points,
traffic_points_intensity_df.groupby("idelem").agg(intensity=("mean_intensity", "mean")).reset_index(),
on="idelem")
df_traffic_points.head()
| idelem | tipo_elem | cod_cent | name | st_x | st_y | latitude | longitude | utm_x | utm_y | district | intensity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1044 | 494 | 03FT08PM01 | 03FT08PM01 | 438963.314635 | 4.474734e+06 | 40.421001 | -3.719488 | -414051.481782 | 4.927311e+06 | Moncloa - Aravaca | 635.772220 |
| 1 | 3600 | 494 | PM30901 | PM30901 | 443729.047369 | 4.473268e+06 | 40.408129 | -3.663184 | -407783.811885 | 4.925429e+06 | Moratalaz | 892.839662 |
| 2 | 3705 | 494 | PM41451 | PM41451 | 439858.261097 | 4.471574e+06 | 40.392598 | -3.708640 | -412843.963659 | 4.923159e+06 | Carabanchel | 1877.529488 |
| 3 | 6823 | 494 | PM41453 | PM41453 | 439188.095183 | 4.470895e+06 | 40.386431 | -3.716471 | -413715.710072 | 4.922257e+06 | Carabanchel | 1720.147974 |
| 4 | 7033 | 495 | 01015 | Pº Castellana S-N - Pl. Colon-Hermosilla | 441569.555897 | 4.475502e+06 | 40.428107 | -3.688839 | -410639.639249 | 4.928350e+06 | Salamanca | 1033.697074 |
p = figure(title="Average Intensity per District", x_axis_type="mercator", y_axis_type="mercator",
height=600, width=600)
p.toolbar.logo = None
# p.toolbar_location = None
patches = []
for name in district_name:
# Districts
source_dict = dict(utm_x = [[x for x in df_districts[df_districts["name"] == name]["utm_x"]]],
utm_y = [[y for y in df_districts[df_districts["name"] == name]["utm_y"]]],
name = [name],
intensity = [total_traffic_df[total_traffic_df["district"] == name]["mean_intensity"].mean()])
source = ColumnDataSource(source_dict)
patch = p.patches(xs="utm_x", ys="utm_y", color=linear_cmap("intensity", "Viridis256", 100, 700), line_width=3, alpha=0.4,
source=source, muted=False, muted_alpha=0.1)
patches.append(patch)
source = ColumnDataSource(df_traffic_points.groupby(["idelem", "utm_x", "utm_y", "name"]).mean("intensity").reset_index())
circles = p.circle(x="utm_x", y="utm_y", line_width=1, color=linear_cmap("intensity", "Viridis256", 100, 700),
source=source, alpha=0.5, size=5, legend_label="Traffic measurement points")
cartodb = get_provider(CARTODBPOSITRON)
p.add_tile(cartodb)
TOOLTIPS = [
("District Name", "@name"),
("Average Intensity", "@intensity"),
]
p.add_tools(HoverTool(tooltips=TOOLTIPS, renderers=patches))
TOOLTIPS = [
("Measurement station Name", "@name"),
("Average Intensity", "@intensity"),
]
p.add_tools(HoverTool(tooltips=TOOLTIPS, renderers=[circles]))
p.background_fill_color = None
p.border_fill_color = None
p.legend.click_policy="hide"
mapper = linear_cmap(field_name='intensity', palette="Viridis256", low=100, high=700)
color_bar = ColorBar(color_mapper=mapper['transform'], width=8)
p.add_layout(color_bar, 'right')
p = layout(p, sizing_mode='scale_both')
if SAVE_PLOTS:
output_file("html_plots/average_intensity_districts.html", title="Average Intensity Districts Map")
save(p)
reset_output()
output_notebook()
else:
show(p)
From this plot, we can see that Centro is not the most transited district. We do not know yet if it is thanks to the measures, or that it is just not a very transited district. What we do know for sure, and that can be also powered by the measures, the surrounded districts, such as Arganzuela, Retiro or Moncloa are some of the busiest districts in the city, so it will be interesting to see if there is a border effect because of Madrid Central.
If we focus more on the traffic points as separated measures instead of districts as a whole, we can observe an expected result. The traffic intensity in the main roads of Madrid is higher than in the secondary roads. We can see that, inside Madrid Central, the most crowded road is Gran Via, which is the main commercial road in the district. We also appreciate a lot of traffic surrounding Madrid Central, which again may be interesting to investigate if it is a normal traffic flow, or it is because of the measures.
Doing the average it is not enough for doing a time series analysis, thus we are going to see how does the traffic intensity per district looks like per day, from 2016 to February 2020. In order to be able to see the evolution of the traffic during time.
# Plot traffic intensity through time by district
p = figure(title="Traffic intensity through time by district", x_axis_label="Date",
y_axis_label="Traffic intensity", height=200, width=400)
fig_lines = []
for name, color in zip(district_name, DISTRICT_COLORS):
source = ColumnDataSource(total_traffic_df[total_traffic_df["district"] == name])
l = p.line(x="date", y="mean_intensity", source=source,
color=color, legend_label=name, visible=True,
line_width=3, alpha=0.8)
fig_lines.append(l)
p.renderers.extend(fig_lines)
p.add_layout(p.legend[0], "right")
p.legend.click_policy = "hide"
p.xaxis.formatter=DatetimeTickFormatter(
days=['%a %d/%m/%Y'],
months=['%b %Y'],
years = ['%Y']
)
# Hover tooltip
TOOLTIPS = [
("District", "@district"),
("Intensity", "@mean_intensity"),
("Day", "@day_of_week @day/@month/@year")
]
p.add_tools(HoverTool(tooltips=TOOLTIPS, mode="vline"))
# Button
button = Button(
label="Switch all lines visibility", button_type="success", max_width=100, max_height=50
)
callback = CustomJS(args=dict(lines=fig_lines),
code="""
for(var i=0; i<lines.length; i++){
lines[i].visible = !lines[i].visible;
}
"""
)
button.js_on_click(callback)
p = layout(p, button, sizing_mode="scale_both")
if SAVE_PLOTS:
output_file("html_plots/traffict_intensity_time_district.html", title="Traffic intensity through time by district")
save(p)
reset_output()
output_notebook()
else:
show(p)
Given the fact that we have 21 districts, the plot becomes a bit overwhelm. But if we show only the district/s that we want to see or compare, we can perceive the progress of the traffic intensity along the days. We detect that Centro is not the district with more traffic, or that almost all the districts follow the same tendencies. When there is a day that has less traffic, we see the decrement in all the districts in proportion to its own intensity.
We suspect that these days when the traffic increases or decreases in all districts are festive or something is celebrated. Then let's try to find out if those days are related to any specific dates.
# Festivities of Madrid
festive_days = [
[1, 1, "New Year"],
[1, 6, "Epiphany of the Lord"],
[5, 1, "Labor Day"],
[5, 2, "Day of the Community of Madrid"],
[5, 15, "San Isidro"],
[7, 25, "Santiago Apóstol"],
[8, 6, "Fiestas de San Cayetano"],
[8, 15, "Asunción de la Virgen"],
[10, 12, "National Day of Spain"],
[11, 1, "All Saints' Day"],
[11, 9, "Días de la Almudena."],
[12, 6, "Day of the Constitution"],
[12, 8, "Inmaculada Conceptción"],
[12, 25, "Christmas"]
]
months_name = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
festivities = pd.DataFrame(festive_days, columns=["month", "day", "festivity"])
festivities["month_name"] = festivities.apply(lambda x: months_name[int(x["month"]-1)], axis=1)
festivities["date"] = festivities["day"].astype(str) + " of " + festivities["month_name"].astype(str)
festivities
| month | day | festivity | month_name | date | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | New Year | January | 1 of January |
| 1 | 1 | 6 | Epiphany of the Lord | January | 6 of January |
| 2 | 5 | 1 | Labor Day | May | 1 of May |
| 3 | 5 | 2 | Day of the Community of Madrid | May | 2 of May |
| 4 | 5 | 15 | San Isidro | May | 15 of May |
| 5 | 7 | 25 | Santiago Apóstol | July | 25 of July |
| 6 | 8 | 6 | Fiestas de San Cayetano | August | 6 of August |
| 7 | 8 | 15 | Asunción de la Virgen | August | 15 of August |
| 8 | 10 | 12 | National Day of Spain | October | 12 of October |
| 9 | 11 | 1 | All Saints' Day | November | 1 of November |
| 10 | 11 | 9 | Días de la Almudena. | November | 9 of November |
| 11 | 12 | 6 | Day of the Constitution | December | 6 of December |
| 12 | 12 | 8 | Inmaculada Conceptción | December | 8 of December |
| 13 | 12 | 25 | Christmas | December | 25 of December |
# Only Centro district
df_centro = total_traffic_df[total_traffic_df["district"] == "Centro"].reset_index(drop=True)
df_centro
| district | date | day_of_week | day | month | year | mean_intensity | |
|---|---|---|---|---|---|---|---|
| 0 | Centro | 2016-01-01 | Friday | 1 | 1 | 2016 | 309.436629 |
| 1 | Centro | 2016-01-02 | Saturday | 2 | 1 | 2016 | 361.187308 |
| 2 | Centro | 2016-01-03 | Sunday | 3 | 1 | 2016 | 373.553736 |
| 3 | Centro | 2016-01-04 | Monday | 4 | 1 | 2016 | 421.640004 |
| 4 | Centro | 2016-01-05 | Tuesday | 5 | 1 | 2016 | 388.572838 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1514 | Centro | 2020-02-25 | Tuesday | 25 | 2 | 2020 | 405.391449 |
| 1515 | Centro | 2020-02-26 | Wednesday | 26 | 2 | 2020 | 417.048690 |
| 1516 | Centro | 2020-02-27 | Thursday | 27 | 2 | 2020 | 431.823006 |
| 1517 | Centro | 2020-02-28 | Friday | 28 | 2 | 2020 | 432.899723 |
| 1518 | Centro | 2020-02-29 | Saturday | 29 | 2 | 2020 | 362.165472 |
1519 rows × 7 columns
# Centro
df_centro_year = df_centro.groupby(["year", "month", "day"]).agg(mean_intensity_year = ("mean_intensity", "mean")).reset_index()
df_centro_year["date"] = pd.to_datetime(df_centro_year[["year", "month", "day"]])
df_centro_year = pd.merge(df_centro_year, festivities[["month", "day", "festivity"]], on=["month", "day"], how="left")
df_centro_year.loc[df_centro_year["festivity"].isna(), "festivity"] = "None"
# All districts
df_year = total_traffic_df.groupby(["year", "month", "day"]).agg(mean_intensity_year = ("mean_intensity", "mean")).reset_index()
df_year["date"] = pd.to_datetime(df_year[["year", "month", "day"]])
df_year = pd.merge(df_year, festivities[["month", "day", "festivity"]], on=["month", "day"], how="left")
df_year.loc[df_year["festivity"].isna(), "festivity"] = "None"
display(df_centro_year)
df_year
| year | month | day | mean_intensity_year | date | festivity | |
|---|---|---|---|---|---|---|
| 0 | 2016 | 1 | 1 | 309.436629 | 2016-01-01 | New Year |
| 1 | 2016 | 1 | 2 | 361.187308 | 2016-01-02 | None |
| 2 | 2016 | 1 | 3 | 373.553736 | 2016-01-03 | None |
| 3 | 2016 | 1 | 4 | 421.640004 | 2016-01-04 | None |
| 4 | 2016 | 1 | 5 | 388.572838 | 2016-01-05 | None |
| ... | ... | ... | ... | ... | ... | ... |
| 1514 | 2020 | 2 | 25 | 405.391449 | 2020-02-25 | None |
| 1515 | 2020 | 2 | 26 | 417.048690 | 2020-02-26 | None |
| 1516 | 2020 | 2 | 27 | 431.823006 | 2020-02-27 | None |
| 1517 | 2020 | 2 | 28 | 432.899723 | 2020-02-28 | None |
| 1518 | 2020 | 2 | 29 | 362.165472 | 2020-02-29 | None |
1519 rows × 6 columns
| year | month | day | mean_intensity_year | date | festivity | |
|---|---|---|---|---|---|---|
| 0 | 2016 | 1 | 1 | 255.887737 | 2016-01-01 | New Year |
| 1 | 2016 | 1 | 2 | 288.450196 | 2016-01-02 | None |
| 2 | 2016 | 1 | 3 | 261.173685 | 2016-01-03 | None |
| 3 | 2016 | 1 | 4 | 347.221850 | 2016-01-04 | None |
| 4 | 2016 | 1 | 5 | 342.530852 | 2016-01-05 | None |
| ... | ... | ... | ... | ... | ... | ... |
| 1514 | 2020 | 2 | 25 | 448.332289 | 2020-02-25 | None |
| 1515 | 2020 | 2 | 26 | 458.179619 | 2020-02-26 | None |
| 1516 | 2020 | 2 | 27 | 467.711812 | 2020-02-27 | None |
| 1517 | 2020 | 2 | 28 | 453.818640 | 2020-02-28 | None |
| 1518 | 2020 | 2 | 29 | 336.626215 | 2020-02-29 | None |
1519 rows × 6 columns
# Create annotations for time marks
startMC_span = Span(location=START_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
finesMC_span = Span(location=FINES_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
endMC_span = Span(location=END_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
# Plot
p = figure(title="Average traffic intensity per day through the years in Madrid Central and in the city of Madrid", x_axis_label="Year",
y_axis_label="Traffic intensity average by day", width=1100)
source = ColumnDataSource(df_centro_year)
p.vbar(x="date", top="mean_intensity_year", source=source, width=0.99, alpha=0.2,
line_alpha=1, line_color=DISTRICT_DARK_COLORS[0],
legend_label="Madrid Central", color=DISTRICT_COLORS[0])
source = ColumnDataSource(df_centro_year[df_centro_year["festivity"] != "None"])
p.vbar(x="date", top="mean_intensity_year", source=source, width=1,
line_color=DISTRICT_DARK_COLORS[3],
legend_label="Festive days in Madrid", color=DISTRICT_COLORS[3])
source = ColumnDataSource(df_year)
p.line(x="date", y="mean_intensity_year", source=source, line_width=3, alpha=0.6,
legend_label="Madrid", color=DISTRICT_COLORS[1])
p.xaxis.formatter=DatetimeTickFormatter(
days=['%a %d/%m/%Y'],
months=['%b %Y'],
years = ['%Y']
)
# add annotations to plot
p.add_layout(startMC_span)
p.add_layout(finesMC_span)
p.add_layout(endMC_span)
# Hover tooltip
TOOLTIPS = [
("Average Intensity per day", "@mean_intensity_year"),
("Date", "@day/@month/@year"),
("Festivity", "@festivity")
]
p.add_tools(HoverTool(tooltips=TOOLTIPS, mode="vline"))
p.add_layout(p.legend[0], "right")
if SAVE_PLOTS:
output_file("html_plots/traffict_intensity_time_in_out.html", title="Traffic intensity through time in and out Madrid Central")
save(p)
reset_output()
output_notebook()
else:
show(p)
Right as we thought, almost all the peaks and valleys coincide with festive days in Madrid. Which gives us an overview of the traffic behavior. We also observe that the area of Madrid Central (Centro district) follows the same average as the whole city of Madrid.
The gray dashed lines represent the 3 key dates (start of Madrid Central, start of fines, end of Madrid Central).
Once we have seen the whole picture of the traffic intensity evolution during the period we are investigating, we want to focus on a yearly, monthly and weekly analysis between Madrid Central area and the city of Madrid.
# Yearly analysis
df_centro_year = df_centro.groupby("year").agg(mean_intensity_year = ("mean_intensity", "mean")).reset_index()
df_year = total_traffic_df.groupby("year").agg(mean_intensity_year = ("mean_intensity", "mean")).reset_index()
display(df_centro_year)
df_year
| year | mean_intensity_year | |
|---|---|---|
| 0 | 2016 | 424.935031 |
| 1 | 2017 | 423.296288 |
| 2 | 2018 | 395.115073 |
| 3 | 2019 | 387.492076 |
| 4 | 2020 | 385.513792 |
| year | mean_intensity_year | |
|---|---|---|
| 0 | 2016 | 402.843768 |
| 1 | 2017 | 395.590693 |
| 2 | 2018 | 400.240504 |
| 3 | 2019 | 397.607394 |
| 4 | 2020 | 409.064831 |
# Plot yearly analysis
p = figure(title="Average traffic intensity per day through the years in Madrid Central and in the city of Madrid", x_axis_label="Year",
y_axis_label="Traffic intensity average by day", height=400, width=800)
source = ColumnDataSource(df_centro_year)
p.vbar(x="year", top="mean_intensity_year", source=source, width=0.5, legend_label="Madrid Central", color=DISTRICT_COLORS[0])
source = ColumnDataSource(df_year)
p.line(x="year", y="mean_intensity_year", source=source, line_width=3, legend_label="Madrid", color=DISTRICT_COLORS[1])
# Hover tooltip
TOOLTIPS = [
("Average Intensity per day", "@mean_intensity_year"),
("Year", "@year")
]
p.add_tools(HoverTool(tooltips=TOOLTIPS, mode="vline"))
p.add_layout(p.legend[0], "right")
if SAVE_PLOTS:
output_file("html_plots/yearly_analysis_in_out.html", title="Average traffic intensity per day through the years in and out of Madrid Central")
save(p)
reset_output()
output_notebook()
else:
show(p)
From this first plot we see that in 2016 and 2017 the average of traffic intensity per day was higher in Madrid Central than in the rest of the city, but in 2018 it changed. Since 2018, we observe how the average from Madrid Central decreases and becomes lower than the average from the city.
# Monthly analysis
months_name = ["Jan", "Feb", "Mar", "Apr", "May", "June", "July", "Aug", "Sept", "Oct", "Nov", "Dec"]
df_centro_month = df_centro.groupby("month").agg(mean_intensity_month = ("mean_intensity","mean")).reset_index()
df_month = total_traffic_df.groupby("month").agg(mean_intensity_month = ("mean_intensity", "mean")).reset_index()
df_centro_month["month_name"] = df_centro_month.apply(lambda x: months_name[int(x["month"]-1)], axis=1)
df_month["month_name"] = df_month.apply(lambda x: months_name[int(x["month"]-1)], axis=1)
display(df_centro_month)
df_month
| month | mean_intensity_month | month_name | |
|---|---|---|---|
| 0 | 1 | 390.225333 | Jan |
| 1 | 2 | 422.934982 | Feb |
| 2 | 3 | 423.078660 | Mar |
| 3 | 4 | 417.780030 | Apr |
| 4 | 5 | 417.368505 | May |
| 5 | 6 | 429.546765 | June |
| 6 | 7 | 390.347374 | July |
| 7 | 8 | 324.546668 | Aug |
| 8 | 9 | 419.914969 | Sept |
| 9 | 10 | 425.815269 | Oct |
| 10 | 11 | 420.092164 | Nov |
| 11 | 12 | 403.422499 | Dec |
| month | mean_intensity_month | month_name | |
|---|---|---|---|
| 0 | 1 | 393.485793 | Jan |
| 1 | 2 | 417.557986 | Feb |
| 2 | 3 | 409.181076 | Mar |
| 3 | 4 | 408.456585 | Apr |
| 4 | 5 | 415.059586 | May |
| 5 | 6 | 428.255877 | June |
| 6 | 7 | 375.828154 | July |
| 7 | 8 | 291.408997 | Aug |
| 8 | 9 | 409.447912 | Sept |
| 9 | 10 | 422.270867 | Oct |
| 10 | 11 | 418.198544 | Nov |
| 11 | 12 | 404.755192 | Dec |
# Plot monthly analysis
p = figure(title="Average traffic intensity per day in a month in Madrid Central and in the city of Madrid", x_axis_label="Month",
y_axis_label="Traffic intensity average by day", height=400, width=800, x_range=months_name)
source = ColumnDataSource(df_centro_month)
p.vbar(x="month_name", top="mean_intensity_month", source=source, width=0.5, legend_label="Madrid Central", color=DISTRICT_COLORS[0])
source = ColumnDataSource(df_month)
p.line(x="month_name", y="mean_intensity_month", source=source, line_width=3, legend_label="Madrid", color=DISTRICT_COLORS[1])
# Hover tooltip
TOOLTIPS = [
("Average Intensity per day", "@mean_intensity_month"),
("Month", "@month_name")
]
p.add_tools(HoverTool(tooltips=TOOLTIPS, mode="vline"))
p.add_layout(p.legend[0], "right")
if SAVE_PLOTS:
output_file("html_plots/monthly_analysis_in_out.html", title="Average traffic intensity per day in a month in and out of Madrid Central")
save(p)
reset_output()
output_notebook()
else:
show(p)
Regarding the average of traffic intensity per day in a month, we see that it is more or less the same for Madrid Central and the city. Still worth noting that for some months the average from Madrid Central is a bit higher. For example during summer holidays (June, July, August, September) or Easter holidays (Match-April depending on the year) which could be related to more tourists going to the city center (Madrid Central area).
# Weekly analysis
week_days = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
df_centro_week = df_centro.groupby("day_of_week").agg(mean_intensity_week = ("mean_intensity", "mean")).reset_index()
df_week = total_traffic_df.groupby("day_of_week").agg(mean_intensity_week = ("mean_intensity", "mean")).reset_index()
df_centro_week["day_of_week"] = df_centro_week["day_of_week"].astype("category")
df_centro_week["day_of_week"] = df_centro_week["day_of_week"].cat.set_categories(week_days)
df_centro_week = df_centro_week.sort_values("day_of_week")
df_week["day_of_week"] = df_week["day_of_week"].astype("category")
df_week["day_of_week"] = df_week["day_of_week"].cat.set_categories(week_days)
df_week = df_week.sort_values("day_of_week")
display(df_centro_week)
df_week
| day_of_week | mean_intensity_week | |
|---|---|---|
| 1 | Monday | 404.781332 |
| 5 | Tuesday | 414.155558 |
| 6 | Wednesday | 423.733646 |
| 4 | Thursday | 434.114738 |
| 0 | Friday | 445.293051 |
| 2 | Saturday | 373.782265 |
| 3 | Sunday | 351.353999 |
| day_of_week | mean_intensity_week | |
|---|---|---|
| 1 | Monday | 419.675982 |
| 5 | Tuesday | 429.313752 |
| 6 | Wednesday | 435.721548 |
| 4 | Thursday | 442.083065 |
| 0 | Friday | 444.174894 |
| 2 | Saturday | 327.552783 |
| 3 | Sunday | 297.044891 |
# Plot weekly analysis
p = figure(title="Average traffic intensity per day in a week in Madrid Central and in the city of Madrid", x_axis_label="Weekday",
y_axis_label="Traffic intensity average by day", height=400, width=800, x_range=week_days)
source = ColumnDataSource(df_centro_week)
p.vbar(x="day_of_week", top="mean_intensity_week", source=source, width=0.5, legend_label="Madrid Central", color=DISTRICT_COLORS[0])
source = ColumnDataSource(df_week)
p.line(x="day_of_week", y="mean_intensity_week", source=source, line_width=3, legend_label="Madrid", color=DISTRICT_COLORS[1])
# Hover tooltip
TOOLTIPS = [
("Average Intensity per day", "@mean_intensity_week"),
("Day of the Week", "@day_of_week")
]
p.add_tools(HoverTool(tooltips=TOOLTIPS, mode="vline"))
p.add_layout(p.legend[0], "right")
if SAVE_PLOTS:
output_file("html_plots/weekly_analysis_in_out.html", title="Average traffic intensity per day in a week in and out of Madrid Central")
save(p)
reset_output()
output_notebook()
else:
show(p)
Lastly, what we can get from the weekly plot is that Madrid Central has a higher average of traffic intensity per day during the weekends than the rest of the city. Also that throughout the week the average increases progressively in both cases.
Once we have done our exploratory analysis, it is important to obtain conclusions doing a more in depth analysis of the relevant information obtained.
It would be interesting to see what is happening around Madrid Central and far from Madrid Central.
First, let's choose some stations so that we can consider the average of them.
station = 'Plaza del Carmen'
outside_stations = list(df_stations[df_stations.is_in_MC == False ]['name'].unique())
around_stations = ['Plaza de España', 'Castellana','Retiro','Méndez Álvaro']
far_stations = ['El Pardo', 'Barajas Pueblo', 'Villaverde', 'Ensanche de Vallecas']
df_stations['is_around_MC'] = df_stations.name.isin(around_stations)
df_stations['is_far_MC'] = df_stations.name.isin(far_stations)
# plot map
p = figure(title="Air quality stations in Madrid", x_axis_type="mercator", y_axis_type="mercator")
# p.background_fill_color = None
source_in = ColumnDataSource(df_stations[df_stations.is_in_MC==True])
source_far = ColumnDataSource(df_stations[df_stations.is_far_MC==True])
source_around = ColumnDataSource(df_stations[df_stations.is_around_MC==True])
source_out = ColumnDataSource(df_stations[(df_stations.is_in_MC==False)&(df_stations.is_far_MC==False)&(df_stations.is_around_MC==False)])
cr_in = p.circle(x="utm_x", y="utm_y", size=10, color=MADRID_IN_OUT_COLORS[0], source=source_in)
cr_out = p.triangle(x="utm_x", y="utm_y", size=10, color=MADRID_IN_OUT_COLORS[1], source=source_out)
cr_far = p.hex(x="utm_x", y="utm_y", size=10, color=DISTRICT_COLORS[2], source=source_far)
cr_around = p.diamond(x="utm_x", y="utm_y", size=10, color=DISTRICT_COLORS[4], source=source_around)
# Madrid Central lines
source_mc = ColumnDataSource(cm_points)
line_cm = p.line(x="utm_x", y="utm_y", color=MADRID_IN_OUT_COLORS[0], line_width=2,
source=source_mc, muted=False, muted_alpha=0.3)
cartodb = get_provider(CARTODBPOSITRON)
p.add_tile(cartodb)
p.border_fill_color = None
p.add_tools(HoverTool(tooltips=[('Name', '@name'),('Monitored Pollutants','@gas_list')], renderers=[cr_in, cr_out, cr_around, cr_far]))
# add interactive legend
legend = Legend(items=[
('Madrid Central Area', [line_cm]),
('IN Madrid Central area', [cr_in]),
('OUT of Madrid Central area', [cr_out]),
('OUT OF and FAR FROM Madrid Central area', [cr_far]),
('OUT OF but AROUND Madrid Central area', [cr_around])
], location='center')
legend.click_policy="hide"
legend.location = "top_right"
legend.background_fill_alpha = 0.5
p.add_layout(legend)
p = layout(p, sizing_mode='scale_both')
if SAVE_PLOTS:
output_file("html_plots/air_quality_stations_some.html")
save(p)
reset_output()
output_notebook()
else:
show(p)
tabs_titles = ['In Madrid Central', 'Outside Madrid Central', 'Around Madrid Central', 'Far from Madrid Central']
tabs_comparison = get_bokeh_tabs_air_stations_for_ratios([station, outside_stations, around_stations, far_stations], tabs_titles)
if SAVE_PLOTS:
output_file("html_plots/air_quality_ratios_evolution_tabs.html")
save(tabs_comparison)
reset_output()
output_notebook()
else:
show(tabs_comparison)
Air quality stations monitor various gas concentrations each hour in Madrid that we can compare in the following interactive line chart.
A positive percentage refers to a gas concentration increase in comparison with the same month of the previous year. In contrast, a negative percentage refers to a gas concentration drop in comparison with the same month of the previous year, which is the aim of Madrid Central.
Inside the low-emission zone of Madrid Central (first tab), the vehicle regulation leads to a decrease or a slower increase of Nitrogen Oxides (NOx = NO2 + NO), Ozone (O3) and Sulfur Dioxide (SO2). These gases are three common pollutants emitted by cars that are highly involved in air pollution. Nonetheless, air pollution outside Madrid Central - both near and far from the low-emission zone borders, starts to decrease when fines are enforced only.
To get a better visualisation over time and space, let's have an animated map.
def get_air_quality_df_for_map(selected_air_formula):
colors = {'neutral':'grey', 'under_standard':'#53c688', 'above_standard':'red'}
# filter by air type
df_map = df_air[df_air.formula == selected_air_formula]
# consider useful columns only
df_map = df_map[['value', 'datetime', 'name', 'longitude', 'latitude']]
# create year and month column
df_map['month'], df_map['year'] = df_map.datetime.dt.month, df_map.datetime.dt.year
# group by station name, year then month and get mean value
df_map = df_map.groupby(['name','year','month']).mean().reset_index()
# create an index, based on Month and Year
df_map['day'] = 1
df_map['date'] = pd.to_datetime(df_map[['month','year','day']])
# change color based on European standard if known
gas_row_index = df_gas[df_gas.formula == selected_air_formula].index[0]
european_limit = df_gas.loc[gas_row_index, 'european_standard']
if european_limit != 'None':
european_limit = int(european_limit)
if (df_gas.loc[gas_row_index, 'unit_per_m3'] != df_gas.loc[gas_row_index, 'unit_per_m3_original']):
# adjust based on a previous homogeneization of data
european_limit = european_limit*100
df_map['fillColor'] = colors['under_standard']
df_map.loc[df_map[df_map.value >= european_limit].index, 'fillColor'] = colors['above_standard']
else:
# else: fill in with the basic color
df_map['fillColor'] = colors['neutral']
# fill in the circle radius based on 'value'
min_value, max_value = min(df_map.value), max(df_map.value)
radius=20
df_map['radius'] = radius*(df_map['value']-min_value)/(max_value-min_value)
return df_map
def create_geojson_features(df):
"""
Create GeoJSON features to get an animated map for spatial timed-data
(code source: https://www.linkedin.com/pulse/visualizing-nyc-bike-data-interactive-animated-maps-folium-toso/)
"""
features = []
for _, row in df.iterrows():
feature = {
'type': 'Feature',
'geometry': {
'type':'Point',
'coordinates':[row['longitude'],row['latitude']]
},
'properties': {
'time': row['date'].__str__(),
'style': {'color' : ''},
'icon': 'circle',
'iconstyle':{
'fillColor': row['fillColor'],
'fillOpacity': 0.8,
'stroke': 'true',
'radius': row['radius']
}
}
}
features.append(feature)
return features
def get_air_quality_animated_map(selected_air_formula):
df_map = get_air_quality_df_for_map(selected_air_formula)
air_geojson = create_geojson_features(df_map)
# set Madrid latitude and longitude
madrid_lat, madrid_long = 40.416775, -3.703790
# create Folium map background centered on Madrid
madrid_map = folium.Map(location = [madrid_lat, madrid_long],
tiles = "CartoDB Positron",
zoom_start = 12)
# create animated map over time
TimestampedGeoJson(air_geojson,
period = 'P1M',
duration = 'P1M',
date_options='MM/YYYY',
transition_time = 400,
loop_button = True,
auto_play = True).add_to(madrid_map)
# make it interactive: get popup on click with infos
for i, aRow in df_stations.iterrows():
popup_text = f"""
<p><b>{aRow['name']}</b></p>
<p>{'IN' if aRow['is_in_MC'] else 'OUT of '} Madrid Central</p>
"""
folium.Marker(
[aRow.latitude, aRow.longitude],
icon=folium.DivIcon(html=""),
popup = popup_text
).add_to(madrid_map)
return madrid_map
pollutant_to_plot = 'NO2'
get_air_quality_animated_map(pollutant_to_plot)
It can be seen that Madrid experienced periods of high pollution before the implementation of Madrid Central, e.g. from 2016 to 2018. Then, after a short transition time of some months, NO2-based air pollution remains below European standards thanks to Madrid Central. Eventually, air pollution becomes critical again as soon as Madrid Central ends.
Let's generate a final multiple bar-chart to inspect the differences in time (before MC, after inauguration, after fines application, and after the end of MC) and space (in MC area, around and far from MC area).
def get_data_multibars_and_boxplots(selected_gas):
def compute_air_quality_data_and_append(aDataFrame):
all_values = []
# before MC
all_values.append(aDataFrame[aDataFrame.datetime < START_DATE].value.values)
# all MC period
all_values.append(aDataFrame[(aDataFrame.datetime >= START_DATE) & (aDataFrame.datetime < END_DATE)].value.values)
# after inauguration
all_values.append(aDataFrame[(aDataFrame.datetime >= START_DATE) & (aDataFrame.datetime < FINES_DATE)].value.values)
# after fines
all_values.append(aDataFrame[(aDataFrame.datetime >= FINES_DATE) & (aDataFrame.datetime < END_DATE)].value.values)
# after end
all_values.append(aDataFrame[aDataFrame.datetime >= END_DATE].value.values)
# append to data multibars
for i in range (len(all_values)):
if len(all_values[i])!=0:
data_multibars[time_categories[i]].append(np.mean(all_values[i]))
else:
data_multibars[time_categories[i]].append(0)
data_multibars = {'location': space_categories}
for aTime in time_categories:
data_multibars[aTime] = []
# filter with selected gas
df_selected_gas = df_air[df_air.formula == selected_gas]
# Location: INSIDE MC
df_inside = df_selected_gas[df_selected_gas['name'] == 'Plaza del Carmen']
compute_air_quality_data_and_append(df_inside)
# Location: OUTSIDE MC
df_outside = df_selected_gas[df_selected_gas['name'] != 'Plaza del Carmen']
compute_air_quality_data_and_append(df_outside)
# Location: AROUND MC
df_around = df_selected_gas[df_selected_gas['name'].isin(around_stations)]
compute_air_quality_data_and_append(df_around)
# Location: FAR FROM MC
df_far = df_selected_gas[df_selected_gas['name'].isin(far_stations)]
compute_air_quality_data_and_append(df_far)
return data_multibars
def get_multibars_viz(data_multibars, selected_gas):
x = [ (period, location) for period in time_categories for location in space_categories]
averages = []
for time_category in time_categories:
averages.extend(data_multibars[time_category])
source = ColumnDataSource(data=dict(x=x, averages=averages))
p = figure(
x_range=FactorRange(*x),
height = 500,
width = 900,
title=f"{selected_gas} average concentration per hour over the MC process",
toolbar_location=None,
tools="hover", tooltips="@averages",
y_axis_label = f"gas concentration in {df_gas[df_gas.formula == selected_gas]['unit_per_m3_original'].values[0]}/m3"
)
p.vbar(x='x', top='averages', width=0.9, source=source, line_color=None,
fill_color=factor_cmap('x', palette=location_based_color_palette, factors=space_categories, start=1, end=4))
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1.3
p.xgrid.grid_line_color = None
p = layout(p, sizing_mode='scale_both')
return p
def get_multibars_viz_for_gas(selected_gas):
multibars = get_data_multibars_and_boxplots(selected_gas)
return get_multibars_viz(multibars, selected_gas)
def get_bokeh_tabs_stats_over_process(aListOfGas):
tabs = []
for aGas in aListOfGas:
p = get_multibars_viz_for_gas(aGas)
tabs.append(Panel(child=p, title=aGas))
return Tabs(tabs=tabs)
space_categories = ['Inside MC area','Outside MC area','Outside but Around MC area','Outside and Far from MC area']
time_categories = ['Before MC', 'During all MC period', 'After MC Inauguration', 'After fines enforcement', 'After MC ending']
location_based_color_palette = [MADRID_IN_OUT_COLORS[0], DISTRICT_COLORS[4], DISTRICT_COLORS[2], MADRID_IN_OUT_COLORS[1]]
all_inspected_gas = ['NO2','SO2','CO', 'O3']
tabs_gas = get_bokeh_tabs_stats_over_process(all_inspected_gas)
if SAVE_PLOTS:
output_file("html_plots/air_quality_multibars_tabs.html")
save(tabs_gas)
reset_output()
output_notebook()
else:
show(tabs_gas)
First, this multi-bars chart allows us to reproduce the results of a previous study that has been conducted on the same subject.
Indeed, Enrique Galdon-Sanchez et al. states that "the concentration of nitrogen dioxide (NO2), a harmful pollutant, decreased by 18.6% in the Madrid Central area". We also found the same drop percentage, with a NO2 concentration from 46.36 µg/m3 to 37.71 µg/m3. Since the European standard for NO2 concentration is 40 µg/m3, Madrid Central operation actually helped the city to stop exceeding the limit.
According to the same research paper, there is almost no border effect, since the NO2 air pollution outside Madrid Central had only a 3, increase during the vehicle regulation. We end up to the same conclusion as well.
In addition, we get a bit more in details since we split the Madrid Central into 2 periods: before and after fines enforcement. We clearly see that a low-emission zone becomes really efficient after fines enforcement, e.g. when citizens risk paying if they drive in the central district!
We also notice that not only the pollution of the central district, but the pollution of ALL Madrid districts, decreased after the application of fines. Given this observation, we can wonder whether this global trend of air pollution decrease in the city is related to weather condition rather than Madrid Central regulation. In fact, rains clean air pollution. However, according to an advanced scientific study on weather, the period of Madrid Central was not that rainy. Thus, the air pollution decrease is not weather-related, so it seems that Madrid Central had a positive effect on the whole city actually!
Besides, we show that ending Madrid Central by lifting the restrictions and authorizing vehicles back in the city center leads to a resurgence of NO2 pollution and a concentration that exceeds the European limit again...
In our exploratory analysis we visualized a static map. But to see if there is a difference we have to analyze if there is a significant difference between before the regulations and after.
traffic_points
| idelem | tipo_elem | cod_cent | name | st_x | st_y | latitude | longitude | utm_x | utm_y | district | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1044 | 494 | 03FT08PM01 | 03FT08PM01 | 438963.314635 | 4.474734e+06 | 40.421001 | -3.719488 | -414051.481782 | 4.927311e+06 | Moncloa - Aravaca |
| 1 | 3600 | 494 | PM30901 | PM30901 | 443729.047369 | 4.473268e+06 | 40.408129 | -3.663184 | -407783.811885 | 4.925429e+06 | Moratalaz |
| 2 | 3705 | 494 | PM41451 | PM41451 | 439858.261097 | 4.471574e+06 | 40.392598 | -3.708640 | -412843.963659 | 4.923159e+06 | Carabanchel |
| 3 | 6823 | 494 | PM41453 | PM41453 | 439188.095183 | 4.470895e+06 | 40.386431 | -3.716471 | -413715.710072 | 4.922257e+06 | Carabanchel |
| 4 | 7033 | 495 | 01015 | Pº Castellana S-N - Pl. Colon-Hermosilla | 441569.555897 | 4.475502e+06 | 40.428107 | -3.688839 | -410639.639249 | 4.928350e+06 | Salamanca |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3861 | 6094 | 495 | 47033 | Rotonda Nudo Manoteras - Av. Alcalde Conde May... | 443873.411055 | 4.481898e+06 | 40.485884 | -3.662246 | -407679.323268 | 4.936803e+06 | Hortaleza |
| 3862 | 6999 | 495 | 55065 | Av. Consejo de Europa - Av. Consejo de Europa-... | 447857.576632 | 4.479151e+06 | 40.461390 | -3.615013 | -402421.374536 | 4.933219e+06 | Barajas |
| 3863 | 9939 | 495 | 87029 | (TACTICO) SALIDA BRICOMART | 441776.002171 | 4.468693e+06 | 40.366783 | -3.685782 | -410299.414801 | 4.919386e+06 | Usera |
| 3864 | 5398 | 495 | 94017 | DOM. PARRAGA O-E(P? FERROV-TRAV. LEGAN) | 439308.041175 | 4.466186e+06 | 40.344022 | -3.714610 | -413508.524109 | 4.916061e+06 | Villaverde |
| 3865 | 5399 | 495 | 94018 | DOMIN. PARRAGA E-O(A.PALACIOS-TR. LEGAN) | 439542.738618 | 4.466176e+06 | 40.343945 | -3.711846 | -413200.815293 | 4.916050e+06 | Villaverde |
3866 rows × 11 columns
df_traffic_points = traffic_points.copy()
df_traffic_points = pd.merge(df_traffic_points,
traffic_points_intensity_df.groupby(["idelem", "year", "month"]).agg(intensity=("mean_intensity", "mean")).reset_index(),
on="idelem")
df_traffic_points["date"] = pd.to_datetime(df_traffic_points["year"].astype(str) + "-" + df_traffic_points["month"].astype(str))
df_traffic_points
| idelem | tipo_elem | cod_cent | name | st_x | st_y | latitude | longitude | utm_x | utm_y | district | year | month | intensity | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1044 | 494 | 03FT08PM01 | 03FT08PM01 | 438963.314635 | 4.474734e+06 | 40.421001 | -3.719488 | -414051.481782 | 4.927311e+06 | Moncloa - Aravaca | 2016 | 1 | 731.360887 | 2016-01-01 |
| 1 | 1044 | 494 | 03FT08PM01 | 03FT08PM01 | 438963.314635 | 4.474734e+06 | 40.421001 | -3.719488 | -414051.481782 | 4.927311e+06 | Moncloa - Aravaca | 2016 | 2 | 774.546446 | 2016-02-01 |
| 2 | 1044 | 494 | 03FT08PM01 | 03FT08PM01 | 438963.314635 | 4.474734e+06 | 40.421001 | -3.719488 | -414051.481782 | 4.927311e+06 | Moncloa - Aravaca | 2016 | 3 | 721.848204 | 2016-03-01 |
| 3 | 1044 | 494 | 03FT08PM01 | 03FT08PM01 | 438963.314635 | 4.474734e+06 | 40.421001 | -3.719488 | -414051.481782 | 4.927311e+06 | Moncloa - Aravaca | 2016 | 4 | 753.728637 | 2016-04-01 |
| 4 | 1044 | 494 | 03FT08PM01 | 03FT08PM01 | 438963.314635 | 4.474734e+06 | 40.421001 | -3.719488 | -414051.481782 | 4.927311e+06 | Moncloa - Aravaca | 2016 | 5 | 737.511780 | 2016-05-01 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 185717 | 5399 | 495 | 94018 | DOMIN. PARRAGA E-O(A.PALACIOS-TR. LEGAN) | 439542.738618 | 4.466176e+06 | 40.343945 | -3.711846 | -413200.815293 | 4.916050e+06 | Villaverde | 2019 | 10 | 134.905726 | 2019-10-01 |
| 185718 | 5399 | 495 | 94018 | DOMIN. PARRAGA E-O(A.PALACIOS-TR. LEGAN) | 439542.738618 | 4.466176e+06 | 40.343945 | -3.711846 | -413200.815293 | 4.916050e+06 | Villaverde | 2019 | 11 | 133.696440 | 2019-11-01 |
| 185719 | 5399 | 495 | 94018 | DOMIN. PARRAGA E-O(A.PALACIOS-TR. LEGAN) | 439542.738618 | 4.466176e+06 | 40.343945 | -3.711846 | -413200.815293 | 4.916050e+06 | Villaverde | 2019 | 12 | 121.234405 | 2019-12-01 |
| 185720 | 5399 | 495 | 94018 | DOMIN. PARRAGA E-O(A.PALACIOS-TR. LEGAN) | 439542.738618 | 4.466176e+06 | 40.343945 | -3.711846 | -413200.815293 | 4.916050e+06 | Villaverde | 2020 | 1 | 128.942083 | 2020-01-01 |
| 185721 | 5399 | 495 | 94018 | DOMIN. PARRAGA E-O(A.PALACIOS-TR. LEGAN) | 439542.738618 | 4.466176e+06 | 40.343945 | -3.711846 | -413200.815293 | 4.916050e+06 | Villaverde | 2020 | 2 | 129.349413 | 2020-02-01 |
185722 rows × 15 columns
map = folium.Map(location=[40.420177, -3.703928], zoom_start=12, tiles='cartodb positron')
heatmap_time_data = defaultdict(list)
for _, row in (df_traffic_points.iterrows()):
key = str(row["month"]) + " " + str(row["year"])
heatmap_time_data[key].append([row["latitude"], row["longitude"]])
folium.plugins.HeatMapWithTime(data=list(heatmap_time_data.values()), index=list(heatmap_time_data.keys()) , radius=5, auto_play=True).add_to(map)
map
Ok, this map as informative as we thought it would be. We can see small changes in different districts, but nothing too significative. That is why we will try to carry out a different visualization, in an attempt to display the changes more clearly.
df_traffic_points_centro = df_traffic_points[df_traffic_points["district"] == "Centro"].reset_index(drop=True)
df_before = df_traffic_points_centro[
(df_traffic_points_centro["date"] < START_DATE)
].groupby(["idelem", "utm_x", "utm_y", "name"]).mean("intensity").reset_index()
df_during = df_traffic_points_centro[
(df_traffic_points_centro["date"] > START_DATE) & (df_traffic_points_centro["date"] < END_DATE)
].groupby(["idelem", "utm_x", "utm_y", "name"]).mean("intensity").reset_index()
df_difference = pd.merge(df_during, df_before[["idelem", "intensity"]], on="idelem", suffixes=["_during", "_before"])
df_difference["intensity_difference"] = df_difference["intensity_during"] - df_difference["intensity_before"]
df_difference
| idelem | utm_x | utm_y | name | tipo_elem | st_x | st_y | latitude | longitude | year | month | intensity_during | intensity_before | intensity_difference | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1043 | -414052.018300 | 4.927195e+06 | 18RT11PM01 | 494.0 | 438962.190268 | 4.474647e+06 | 40.420210 | -3.719493 | 2018.857143 | 4.714286 | 595.987088 | 623.879838 | -27.892750 |
| 1 | 3401 | -412468.892777 | 4.926544e+06 | Concepcion Jerónima O-E - Salvador -Toledo | 495.0 | 440164.727852 | 4.474143e+06 | 40.415759 | -3.705271 | 2018.857143 | 4.714286 | 234.710311 | 237.147019 | -2.436708 |
| 2 | 3410 | -414426.518599 | 4.926423e+06 | Pº ERMITA SANTO S-N | 495.0 | 438671.994078 | 4.474063e+06 | 40.414930 | -3.722857 | 2018.857143 | 4.714286 | 125.955725 | 123.916027 | 2.039698 |
| 3 | 3444 | -413798.068330 | 4.926039e+06 | (TACTICO) Pº Melancólicos S-N - Pº Imperial-Ro... | 495.0 | 439148.595440 | 4.473767e+06 | 40.412298 | -3.717211 | 2018.857143 | 4.714286 | 35.263098 | 31.961285 | 3.301813 |
| 4 | 3475 | -413178.356265 | 4.925754e+06 | Gran Vía de San Francisco de Sales S-N - Puert... | 495.0 | 439619.183579 | 4.473547e+06 | 40.410349 | -3.711644 | 2018.857143 | 4.714286 | 608.253863 | 643.755345 | -35.501482 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 157 | 9970 | -413658.824592 | 4.925645e+06 | Pº Imperial N-S - Jemenuño -Gil Imón | 495.0 | 439252.303777 | 4.473467e+06 | 40.409607 | -3.715960 | 2018.857143 | 4.714286 | 137.526724 | 119.348245 | 18.178479 |
| 158 | 9971 | -413646.155461 | 4.925647e+06 | Pº Imperial S-N - Gil Imón -Jemenuño | 495.0 | 439261.974395 | 4.473469e+06 | 40.409623 | -3.715847 | 2018.857143 | 4.714286 | 131.399655 | 125.555571 | 5.844083 |
| 159 | 9972 | -413534.466929 | 4.925579e+06 | Gil Imón E-O - Villoslada -Pº Imperial | 495.0 | 439346.686566 | 4.473417e+06 | 40.409158 | -3.714843 | 2018.857143 | 4.714286 | 56.917851 | 57.077523 | -0.159673 |
| 160 | 10034 | -410853.739846 | 4.927927e+06 | Barbara Braganza E-O - Pº Recoletos - Marques... | 495.0 | 441403.900298 | 4.475182e+06 | 40.425213 | -3.690762 | 2018.857143 | 4.714286 | 174.036159 | 227.919912 | -53.883753 |
| 161 | 10036 | -411063.962135 | 4.927983e+06 | Barbara Braganza O-E - Conde Xiquena-Tamayo y... | 495.0 | 441244.040464 | 4.475227e+06 | 40.425599 | -3.692650 | 2018.857143 | 4.714286 | 53.440589 | 56.407000 | -2.966411 |
162 rows × 14 columns
p_before = figure(title="Average traffic intensity per station BEFORE Madrid Central", x_axis_type="mercator", y_axis_type="mercator",
height=1000, width=1000, tools="")
p_before.axis.visible = False
p_before.toolbar.logo = None
p_before.toolbar_location = None
# Districts
name = "Centro"
source = ColumnDataSource(df_districts[df_districts["name"] == "Centro"])
p_before.line(x="utm_x", y="utm_y", color=MADRID_IN_OUT_COLORS[0], line_width=4,
source=source, legend_label="Madrid Central limit",)
source = ColumnDataSource(df_before)
circles = p_before.circle(x="utm_x", y="utm_y", line_width=1, color=linear_cmap("intensity", "Viridis256", 0, 1000),
source=source, size=10)
cartodb = get_provider(CARTODBPOSITRON)
p_before.add_tile(cartodb)
TOOLTIPS = [
("Measurement station Name", "@name"),
("Average Intensity", "@intensity"),
]
p_before.add_tools(HoverTool(tooltips=TOOLTIPS, renderers=[circles]))
p_before.background_fill_color = None
p_before.border_fill_color = None
mapper = linear_cmap(field_name='intensity', palette="Viridis256", low=0, high=1000)
color_bar = ColorBar(color_mapper=mapper['transform'], width=8)
# DURING
p_during = figure(title="Average traffic intensity per station DURING Madrid Central", x_axis_type="mercator", y_axis_type="mercator",
height=1000, width=1000, tools="")
p_during.axis.visible = False
p_during.toolbar.logo = None
p_during.toolbar_location = None
# Districts
name = "Centro"
source = ColumnDataSource(df_districts[df_districts["name"] == "Centro"])
p_during.line(x="utm_x", y="utm_y", color=MADRID_IN_OUT_COLORS[0], line_width=4,
source=source, legend_label="Madrid Central limit",)
source = ColumnDataSource(df_during)
circles = p_during.circle(x="utm_x", y="utm_y", line_width=1, color=linear_cmap("intensity", "Viridis256", 0, 1000),
source=source, size=10)
cartodb = get_provider(CARTODBPOSITRON)
p_during.add_tile(cartodb)
TOOLTIPS = [
("Measurement station Name", "@name"),
("Average Intensity", "@intensity"),
]
p_during.add_tools(HoverTool(tooltips=TOOLTIPS, renderers=[circles]))
p_during.background_fill_color = None
p_during.border_fill_color = None
mapper = linear_cmap(field_name='intensity', palette="Viridis256", low=0, high=1000)
color_bar = ColorBar(color_mapper=mapper['transform'], width=8)
p_during.add_layout(color_bar, 'right')
p = layout([[p_before, p_during]], sizing_mode='scale_both')
if SAVE_PLOTS:
output_file("html_plots/centro_traffic_points.html")
save(p)
reset_output()
output_notebook()
else:
show(p)
It seems that the measure was efficient. We can appreciate some changes, specially in the area of Gran Vía (The long road that cross the district on the North, from East to West). To better display these changes, we plot the difference between before and during, and therefore see a better outcome. Just from this data, it looks like the measure was effective, and reduced the amount of vehicles in the city center.
But we do not know yet if this is a real difference, or if it is just that in this period of time there was less traffic in all of Madrid. We need to do a further analysis to observe that.
p_diff = figure(title="Difference of traffic intensities per station before and during Madrid Central", x_axis_type="mercator", y_axis_type="mercator",
height=1000, width=1000, tools="")
p_diff.toolbar.logo = None
p_diff.toolbar_location = None
# Districts
name = "Centro"
source = ColumnDataSource(df_districts[df_districts["name"] == "Centro"])
p_diff.line(x="utm_x", y="utm_y", color=MADRID_IN_OUT_COLORS[0], line_width=4,
source=source, legend_label="Madrid Central limit",)
source = ColumnDataSource(df_difference)
circles = p_diff.circle(x="utm_x", y="utm_y", line_width=1, color=linear_cmap("intensity_difference", "Plasma256", -300, 300),
source=source, size=10)
cartodb = get_provider(CARTODBPOSITRON)
p_diff.add_tile(cartodb)
TOOLTIPS = [
("Measurement station Name", "@name"),
("Average Intensity", "@intensity_difference"),
]
p_diff.add_tools(HoverTool(tooltips=TOOLTIPS, renderers=[circles]))
p_diff.background_fill_color = None
p_diff.border_fill_color = None
mapper = linear_cmap(field_name='intensity', palette="Plasma256", low=-300, high=300)
color_bar = ColorBar(color_mapper=mapper['transform'], width=8)
p_diff.add_layout(color_bar, 'right')
p = layout([[p_diff]], sizing_mode='scale_both')
if SAVE_PLOTS:
output_file("html_plots/centro_traffic_points_diff.html")
save(p)
reset_output()
output_notebook()
else:
show(p)
To compare Madrid Central with the whole city of Madrid, we will do a week analysis of the traffic, to try to find if the trends before and during are the same as the ones we saw in
# compute data before Madrid Central inauguration, INSIDE Centro
df_centro_week_before = df_centro[df_centro["date"] < START_DATE].groupby("day_of_week").agg(mean_intensity_week = ("mean_intensity", "mean")).reset_index()
df_centro_week_before["day_of_week"] = df_centro_week_before["day_of_week"].astype("category")
df_centro_week_before["day_of_week"] = df_centro_week_before["day_of_week"].cat.set_categories(week_days)
df_centro_week_before = df_centro_week_before.sort_values("day_of_week")
# compute data before Madrid Central inauguration, OUTSIDE Centro
df_week_before = total_traffic_df[total_traffic_df["date"] < START_DATE].groupby("day_of_week").agg(mean_intensity_week = ("mean_intensity", "mean")).reset_index()
df_week_before["day_of_week"] = df_week_before["day_of_week"].astype("category")
df_week_before["day_of_week"] = df_week_before["day_of_week"].cat.set_categories(week_days)
df_week_before = df_week_before.sort_values("day_of_week")
# get the difference between INSIDE and OUTSIDE Centro, before the MC inauguration
week_diff_before = []
for day in week_days:
inside_mc = df_centro_week_before[df_centro_week_before["day_of_week"] == day]["mean_intensity_week"].values[0]
outside_mc = df_week_before[df_week_before["day_of_week"] == day]["mean_intensity_week"].values[0]
week_diff_before.append([day, outside_mc-inside_mc])
df_week_diff_before = pd.DataFrame(week_diff_before, columns=["day_of_week", "difference_of_mean_intensity"])
# WEEK BAR PLOT
p_week = figure(title="Average traffic intensity per day in a week IN and OUT, \nBEFORE Madrid Central",
x_axis_label="Weekday", y_axis_label="Traffic intensity average by day",
width=400, height=400, x_range=week_days, tools="")
p_week.toolbar.logo = None
p_week.toolbar_location = None
p_week.border_fill_color = None
# WEEK DIFF LOLLIPOP PLOT
x_lim = 70
p_diff = figure(title="Difference of averages of traffic intensity IN and OUT, \nBEFORE Madrid Central",
y_axis_label="Weekday", x_axis_label="Difference in traffic",
width=400, height=400, y_range=week_days[::-1], x_range=(-x_lim, x_lim), tools="")
p_diff.toolbar.logo = None
p_diff.toolbar_location = None
p_diff.border_fill_color = None
circles = []
for i, week_day in enumerate(week_days):
source = ColumnDataSource(df_centro_week_before[df_centro_week_before["day_of_week"] == week_day])
p_week.vbar(x="day_of_week", top="mean_intensity_week", source=source, width=0.5,
legend_label="Madrid Central", alpha=0.2,
line_alpha=1, line_color=WEEK_DARK_COLORS[i],
color=WEEK_COLORS[i])
source = ColumnDataSource(df_week_diff_before[df_week_diff_before["day_of_week"] == week_day])
p_diff.hbar(y="day_of_week", right="difference_of_mean_intensity", source=source,
height=0.05, color=WEEK_COLORS[i])
circle = p_diff.circle(x="difference_of_mean_intensity", y="day_of_week", source=source, size=15,
color=WEEK_COLORS[i])
circles.append(circle)
source = ColumnDataSource(df_week)
p_week.line(x="day_of_week", y="mean_intensity_week", source=source,
line_width=3, legend_label="OUT of Madrid Central", color=WEEK_COLORS[7])
# Hover tooltip
TOOLTIPS = [
("Average Intensity per day", "@mean_intensity_week"),
("Day of the Week", "@day_of_week")
]
p_week.add_tools(HoverTool(tooltips=TOOLTIPS, mode="vline"))
p_week.add_layout(p_week.legend[0], "below")
TOOLTIPS = [
("Difference of intensity IN and OUT of Madrid Central", "@difference_of_mean_intensity{00.00}"),
("Day of the Week", "@day_of_week")]
p_diff.add_tools(HoverTool(tooltips=TOOLTIPS, mode="hline", renderers=circles))
p = layout([p_week, p_diff], sizing_mode='stretch_both')
if SAVE_PLOTS:
output_file("html_plots/week_analysis_before.html")
save(p)
reset_output()
output_notebook()
else:
show(p)
# compute data during Madrid Central, INSIDE Centro
df_centro_week_after = df_centro[(df_centro["date"] >= START_DATE) & (df_centro["date"] < END_DATE)].groupby("day_of_week").agg(mean_intensity_week = ("mean_intensity", "mean")).reset_index()
df_centro_week_after["day_of_week"] = df_centro_week_after["day_of_week"].astype("category")
df_centro_week_after["day_of_week"] = df_centro_week_after["day_of_week"].cat.set_categories(week_days)
df_centro_week_after = df_centro_week_after.sort_values("day_of_week")
# compute data during Madrid Central, OUTSIDE Centro
df_week_after = total_traffic_df[(total_traffic_df["date"] >= START_DATE) & (total_traffic_df["date"] < END_DATE)].groupby("day_of_week").agg(mean_intensity_week = ("mean_intensity", "mean")).reset_index()
df_week_after["day_of_week"] = df_week_after["day_of_week"].astype("category")
df_week_after["day_of_week"] = df_week_after["day_of_week"].cat.set_categories(week_days)
df_week_after = df_week_after.sort_values("day_of_week")
# get the difference between INSIDE and OUTSIDE Centro, during MC
week_diff_after = []
for day in week_days:
inside_mc = df_centro_week_after[df_centro_week_after["day_of_week"] == day]["mean_intensity_week"].values[0]
outside_mc = df_week_after[df_week_after["day_of_week"] == day]["mean_intensity_week"].values[0]
week_diff_after.append([day, outside_mc-inside_mc])
df_week_diff_after = pd.DataFrame(week_diff_after, columns=["day_of_week", "difference_of_mean_intensity"])
# WEEK BAR PLOT
p_week = figure(title="Average traffic intensity per day in a week IN and OUT, \nDURING Madrid Central",
x_axis_label="Weekday", y_axis_label="Traffic intensity average by day",
width=400, height=400, x_range=week_days, tools="")
p_week.toolbar.logo = None
p_week.toolbar_location = None
p_week.border_fill_color = None
# WEEK DIFF LOLLIPOP PLOT
x_lim = 70
p_diff = figure(title="Difference of averages of traffic intensity IN and OUT, \nDURING Madrid Central",
y_axis_label="Weekday", x_axis_label="Difference in traffic",
width=400, height=400, y_range=week_days[::-1], x_range=(-x_lim, x_lim), tools="")
p_diff.toolbar.logo = None
p_diff.toolbar_location = None
p_diff.border_fill_color = None
circles = []
for i, week_day in enumerate(week_days):
source = ColumnDataSource(df_centro_week_after[df_centro_week_after["day_of_week"] == week_day])
p_week.vbar(x="day_of_week", top="mean_intensity_week", source=source, width=0.5,
legend_label="Madrid Central", alpha=0.2,
line_alpha=1, line_color=WEEK_DARK_COLORS[i],
color=WEEK_COLORS[i])
source = ColumnDataSource(df_week_diff_after[df_week_diff_after["day_of_week"] == week_day])
p_diff.hbar(y="day_of_week", right="difference_of_mean_intensity", source=source,
height=0.05, color=WEEK_COLORS[i])
circle = p_diff.circle(x="difference_of_mean_intensity", y="day_of_week", source=source, size=15,
color=WEEK_COLORS[i])
circles.append(circle)
source = ColumnDataSource(df_week)
p_week.line(x="day_of_week", y="mean_intensity_week", source=source,
line_width=3, legend_label="OUT of Madrid Central", color=WEEK_COLORS[7])
# Hover tooltip
TOOLTIPS = [
("Average Intensity per day", "@mean_intensity_week"),
("Day of the Week", "@day_of_week")
]
p_week.add_tools(HoverTool(tooltips=TOOLTIPS, mode="vline"))
p_week.add_layout(p_week.legend[0], "below")
TOOLTIPS = [
("Difference of intensity IN and OUT of Madrid Central", "@difference_of_mean_intensity{00.00}"),
("Day of the Week", "@day_of_week")
]
p_diff.add_tools(HoverTool(tooltips=TOOLTIPS, mode="hline", renderers=circles))
p = layout([p_week, p_diff], sizing_mode='stretch_both')
if SAVE_PLOTS:
output_file("html_plots/week_analysis_during.html")
save(p)
reset_output()
output_notebook()
else:
show(p)
This seems promising, it looks like Madrid Central was indeed effective to reduce traffic inside the city center. We may be still falling in some biases tho. For example, we are not sure, as we are working with averages, if this decrease is related to it only being effective during a season of low vehicle activity.
To check this we have to display all the data we have somehow to see if this is the case, and this improvement is due to Madrid Central being effective, or it happened because any other reason.
df_month_districts = total_traffic_df.groupby(["district", "year", "month"])["mean_intensity"].mean().reset_index()
df_month_districts["month_name"] = df_month_districts["month"].apply(lambda x: months_name[x-1])
df_month_districts["date"] = pd.to_datetime(df_month_districts["month_name"] + "-" + df_month_districts["year"].astype(str))
df_month_districts_centro = df_month_districts[df_month_districts["district"] == "Centro"].reset_index(drop=True)
df_month_districts_not_centro = df_month_districts[df_month_districts["district"] != "Centro"].reset_index(drop=True)
df_month_districts_not_centro = df_month_districts_not_centro.groupby(["year", "month"]).mean("mean_intensity").reset_index()
df_month_districts_centro["mean_intensity"] = df_month_districts_centro["mean_intensity"] - df_month_districts_not_centro["mean_intensity"]
p = figure(x_range=months_name, title="Average traffic intensity per month in a year in Madrid Central, from 2016 to February 2020", x_axis_label="Month",
y_axis_label="Traffic intensity by month")
for i, year in enumerate(df_month_districts["year"].unique()):
source = ColumnDataSource(df_month_districts[(df_month_districts["district"] == "Centro") & (df_month_districts["year"] == year)])
p.line(x="month_name", y="mean_intensity", source=source, legend_label=str(year), color=DISTRICT_COLORS[i], line_width=2)
p.circle(x="month_name", y="mean_intensity", source=source, legend_label=str(year), color=DISTRICT_COLORS[i], size=10)
p.border_fill_color = None
TOOLTIPS = [
("Average Intensity per month", "@mean_intensity"),
("Month", "@month_name"),
("Year", "@year")
]
p.add_tools(HoverTool(tooltips=TOOLTIPS))
p.add_layout(p.legend[0], "right")
p.legend.click_policy = "hide"
p = layout(p, sizing_mode='stretch_both')
if SAVE_PLOTS:
output_file("html_plots/month_analysis.html")
save(p)
reset_output()
output_notebook()
else:
show(p)
Now, analyzing the traffic behavior during the different months of the year inside Madrid Central area, we spot that in summer (June, July, August) the traffic intensity decreases a lot, specially in August. This is something that we already expected, as well as a small decrease at Christmastime (December, January). During holiday periods, people tend to leave the city center.
We can also appreciate that the years 2018 and 2019 have a lower average of traffic intensity per month, compared with 2016 and 2017.
p = figure(x_range=months_name, title="Difference of averages of traffic intensity IN and OUT of Madrid Central, from 2016 to February 2020", x_axis_label="Month",
y_axis_label="Difference in traffic")
p.border_fill_color = None
for i, year in enumerate(df_month_districts_centro["year"].unique()):
source = ColumnDataSource(df_month_districts_centro[df_month_districts_centro["year"] == year])
p.line(x="month_name", y="mean_intensity", source=source, legend_label=str(year), color=DISTRICT_COLORS[i], line_width=2)
p.circle(x="month_name", y="mean_intensity", source=source, legend_label=str(year), color=DISTRICT_COLORS[i], size=10)
TOOLTIPS = [
("Difference of intensity IN and OUT of Madrid Central", "@mean_intensity"),
("Month", "@month_name"),
("Year", "@year")
]
p.add_tools(HoverTool(tooltips=TOOLTIPS))
p.add_layout(p.legend[0], "right")
zero_line_span = Span(location=0, dimension='width', line_color='red', line_width=1, line_alpha=0.3)
p.add_layout(zero_line_span)
p.legend.click_policy = "hide"
p = layout(p, sizing_mode='stretch_both')
if SAVE_PLOTS:
output_file("html_plots/month_difference_analysis.html")
save(p)
reset_output()
output_notebook()
else:
show(p)
If we calculate the difference of the average of traffic between Madrid Central area and the other districts, we discover that in 2018 and 2019 this difference is less. Once again, as expected, the traffic is more intense on Summer in the city center respect the other districts. We can also see a huge difference in 2019 at the beginning of the year, when the measure was in place. This may indicate, once again, its effectiveness.
df_month_districts_outside = df_month_districts[df_month_districts["district"] != "Centro"].groupby(["date", "month_name"]).mean("mean_intensity").reset_index()
p = figure(x_axis_type='datetime', title="Average traffic intensity per month IN and OUT of Madrid Central, from 2016 to February 2020", x_axis_label="Year",
y_axis_label="Average traffic intensity by month")
p.border_fill_color = None
source = ColumnDataSource(df_month_districts[df_month_districts["district"] == "Centro"])
p.line(x="date", y="mean_intensity", source=source, legend_label="IN Madrid Central", color=MADRID_IN_OUT_COLORS[0], line_width=2)
p.circle(x="date", y="mean_intensity", source=source, legend_label="IN Madrid Central", color=MADRID_IN_OUT_COLORS[0], size=10)
source = ColumnDataSource(df_month_districts_outside)
p.line(x="date", y="mean_intensity", source=source, legend_label="OUT of Madrid Central", color=MADRID_IN_OUT_COLORS[1], line_width=2)
p.circle(x="date", y="mean_intensity", source=source, legend_label="OUT of Madrid Central", color=MADRID_IN_OUT_COLORS[1], size=10)
TOOLTIPS = [
("Average intensity per month", "@mean_intensity"),
("Month", "@month_name"),
("Year", "@year")
]
p.add_tools(HoverTool(tooltips=TOOLTIPS))
# p.add_layout(p.legend[0], "right")
p.legend.click_policy = "hide"
# create annotations for time marks
startMC_span = Span(location=START_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
finesMC_span = Span(location=FINES_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
endMC_span = Span(location=END_MC,
dimension='height', line_color='black',
line_dash='dashed', line_width=2, line_alpha=0.3)
# add annotations to plot
p.add_layout(startMC_span)
p.add_layout(finesMC_span)
p.add_layout(endMC_span)
p = layout(p, sizing_mode='stretch_both')
if SAVE_PLOTS:
output_file("html_plots/year_in_out_analysis.html")
save(p)
reset_output()
output_notebook()
else:
show(p)
Now that we have analyze air quality and traffic data on their own, it is important to find the raltion (if there is one) between them. The most obvious first step is to compute the correlation matrix.
df_air["date"] = pd.to_datetime(df_air["datetime"].dt.strftime('%Y-%m-%d'))
no2_df = df_air[df_air["formula"] == "NO2"].groupby("date").mean().reset_index()[["date", "value"]]
no2_df = no2_df.rename(columns={"value": "NO2"})
so2_df = df_air[df_air["formula"] == "SO2"].groupby("date").mean().reset_index()[["date", "value"]]
so2_df = so2_df.rename(columns={"value": "SO2"})
co_df = df_air[df_air["formula"] == "CO"].groupby("date").mean().reset_index()[["date", "value"]]
co_df = co_df.rename(columns={"value": "CO"})
o3_df = df_air[df_air["formula"] == "O3"].groupby("date").mean().reset_index()[["date", "value"]]
o3_df = o3_df.rename(columns={"value": "O3"})
tr_df = traffic_points_intensity_df.groupby("date").mean().reset_index()[["date", "mean_intensity"]]
tr_df = tr_df.rename(columns={"mean_intensity": "traffic"})
df_air_traffic = pd.merge(no2_df, so2_df, on="date", how="inner")
df_air_traffic = pd.merge(df_air_traffic, co_df, on="date", how="inner")
df_air_traffic = pd.merge(df_air_traffic, o3_df, on="date", how="inner")
df_air_traffic = pd.merge(df_air_traffic, tr_df, on="date", how="inner")
df_air_traffic
| date | NO2 | SO2 | CO | O3 | traffic | |
|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | 34.980072 | 8.121739 | 0.336087 | 26.520217 | 278.539462 |
| 1 | 2016-01-02 | 25.631944 | 7.562500 | 0.290417 | 46.012262 | 310.490247 |
| 2 | 2016-01-03 | 31.251736 | 7.329167 | 0.310000 | 32.623065 | 278.447363 |
| 3 | 2016-01-04 | 18.300347 | 7.579167 | 0.223333 | 40.120446 | 367.332237 |
| 4 | 2016-01-05 | 27.770833 | 7.862500 | 0.282500 | 57.022768 | 360.633167 |
| ... | ... | ... | ... | ... | ... | ... |
| 1514 | 2020-02-25 | 54.059028 | 7.795833 | 0.467917 | 30.054851 | 491.780896 |
| 1515 | 2020-02-26 | 21.821181 | 6.237500 | 0.229583 | 69.188631 | 503.578301 |
| 1516 | 2020-02-27 | 25.453125 | 6.170833 | 0.241250 | 58.634613 | 513.042999 |
| 1517 | 2020-02-28 | 51.290435 | 7.802521 | 0.393277 | 33.923482 | 497.364935 |
| 1518 | 2020-02-29 | 24.945848 | 6.279817 | 0.311927 | 48.424315 | 365.071673 |
1519 rows × 6 columns
no2_df = df_air[(df_air["formula"] == "NO2") & (df_air["name"] == "Plaza del Carmen")].groupby("date").mean().reset_index()[["date", "value"]]
no2_df = no2_df.rename(columns={"value": "NO2"})
so2_df = df_air[(df_air["formula"] == "SO2") & (df_air["name"] == "Plaza del Carmen")].groupby("date").mean().reset_index()[["date", "value"]]
so2_df = so2_df.rename(columns={"value": "SO2"})
co_df = df_air[(df_air["formula"] == "CO") & (df_air["name"] == "Plaza del Carmen")].groupby("date").mean().reset_index()[["date", "value"]]
co_df = co_df.rename(columns={"value": "CO"})
o3_df = df_air[(df_air["formula"] == "O3") & (df_air["name"] == "Plaza del Carmen")].groupby("date").mean().reset_index()[["date", "value"]]
o3_df = o3_df.rename(columns={"value": "O3"})
tr_df = traffic_points_intensity_df[traffic_points_intensity_df["district"] == "Centro"].groupby("date").mean().reset_index()[["date", "mean_intensity"]]
tr_df = tr_df.rename(columns={"mean_intensity": "traffic"})
df_air_traffic_madrid = pd.merge(no2_df, so2_df, on="date", how="inner")
df_air_traffic_madrid = pd.merge(df_air_traffic_madrid, co_df, on="date", how="inner")
df_air_traffic_madrid = pd.merge(df_air_traffic_madrid, o3_df, on="date", how="inner")
df_air_traffic_madrid = pd.merge(df_air_traffic_madrid, tr_df, on="date", how="inner")
df_air_traffic_madrid
| date | NO2 | SO2 | CO | O3 | traffic | |
|---|---|---|---|---|---|---|
| 0 | 2016-01-01 | 39.652174 | 5.652174 | 0.386957 | 21.873913 | 309.436629 |
| 1 | 2016-01-02 | 32.041667 | 3.000000 | 0.245833 | 35.020833 | 361.187308 |
| 2 | 2016-01-03 | 35.666667 | 3.333333 | 0.341667 | 27.875417 | 373.553736 |
| 3 | 2016-01-04 | 22.791667 | 6.583333 | 0.275000 | 31.146250 | 421.640004 |
| 4 | 2016-01-05 | 35.833333 | 7.875000 | 0.258333 | 43.290417 | 388.572838 |
| ... | ... | ... | ... | ... | ... | ... |
| 1492 | 2020-02-25 | 61.541667 | 8.791667 | 0.720833 | 29.670833 | 405.391449 |
| 1493 | 2020-02-26 | 28.750000 | 9.750000 | 0.170833 | 90.474167 | 417.048690 |
| 1494 | 2020-02-27 | 27.791667 | 9.708333 | 0.291667 | 82.999583 | 431.823006 |
| 1495 | 2020-02-28 | 55.500000 | 11.458333 | 0.450000 | 48.650417 | 432.899723 |
| 1496 | 2020-02-29 | 26.958333 | 9.083333 | 0.458333 | 70.207083 | 362.165472 |
1497 rows × 6 columns
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(13, 11))
fig.suptitle("Correlation of the different variables")
ax1.set_title("Correlation in the whole city of Madrid")
ax2.set_title("Correlation IN Madrid Central")
ax3.set_title("Correlation IN Madrid Central BEFORE the measures")
ax4.set_title("Correlation IN Madrid Central DURING the measures")
sns.heatmap(df_air_traffic.corr(), cmap="viridis", vmax=1, vmin=-1, annot=True, fmt=".2f", ax=ax1)
sns.heatmap(df_air_traffic_madrid.corr(), cmap="viridis", vmax=1, vmin=-1, annot=True, fmt=".2f", ax=ax2)
sns.heatmap(df_air_traffic_madrid[df_air_traffic_madrid["date"] < START_DATE].corr(), cmap="viridis", vmax=1, vmin=-1, annot=True, fmt=".2f", ax=ax3)
sns.heatmap(df_air_traffic_madrid[(df_air_traffic_madrid["date"] >= START_DATE) & (df_air_traffic_madrid["date"] < END_DATE)].corr(), cmap="viridis", vmax=1, vmin=-1, annot=True, fmt=".2f", ax=ax4)
plt.tight_layout()
plt.savefig("html_plots/correlation_day")
plt.show()
CONCLUSIONS
In the light of this Madrid Central study case, we investigated whether the implementation of vehicle regulations in dense urban areas contributes to reducing pollution.
To recall our primary question: WAS MADRID CENTRAL EFFICIENT AS EXPECTED? The answer is YES!
Regarding both air quality and traffic data, we show that Madrid Central was efficient. The NO2 concentration decreased by 18.6%, and it decreases especially after the enforcement of fines for drivers. This way, Madrid ended below the European standard of NO2 as required. Traffic also decreased during the implementation period of Madrid Central, and it even gets below the traffic that is observed outside the low-emission area. Nevertheless, based on the correlation matrices we computed, air quality and traffic don't seem very correlated. Even though they are positively correlated, the correlation coefficients are lower than expected intuitively. A part of the explanation may rely on the fact that air pollution depends on many more factors than traffic: it is related to urban life, surrounding buildings, weather or people behaviors for instance.
Finally, as you know, Madrid Central regulation has been lifted due to a political reversal... We show in our study that allowing cars back in the city center lead to a return to the original pollution situation. The current Council of Madrid is in transition to apply their new environmental plan called MADRID 360.
MADRID 360 is a much more ambitious measure, but with a long-term staging, and far less restrictive than Madrid Central. It allows more vehicles inside the city center, has more parking spots, etc. It will increase the area of considered limitations zone year by year, in an attempt to reduce the traffic further, and hopefully it will.
As a summary, MADRID 360 has fewer restrictions on vehicles, but covers a much greater area.
Stay tuned, and let's hope this new plan will be effective enough, for the sake of the health of our dear Earth!
Our data story is presented as a Partitioned Poster, by including as many user interactions as possible.
Referring to Segel and Heer concepts, we use a mixture of author-driven and reader-driven stories.
Indeed, we provide a "linear ordering of scenes" and we have "heavy messaging" over the webpage to explain our findings, which relates to author-driven stories. Nonetheless, we offer the possibility of free navigation to the user, and each visualisation allows "free interactivy" to the user, which makes our story more user-driven.

[Definition] Visual structuring refers to "mechanisms that communicate the overall structure of the narrative to the viewer and allow him to identify his position within the larger organization of the visualization" (Segel and Heer).
[Our application] We use Consistent Visual Platform, as we are telling our story through a webpage that induces an implicit hierarchy of the presented sections by following the reading direction.

[Definition] Highlighting refers to "visual mechanisms that help direct the viewer’s attention to particular elements in the display" (Segel and Heer).
[Our application] We use Feature Distinction as we divide our story into different sections that are clearly separated (different background colors, distinguishable title...).

[Definition] Transition guidance concerns "techniques for moving within or between visual scenes without disorienting the viewer" (Segel and Heer).
[Our application] We use Object Continuity, as we follow the same template design for all of our sections: colors follow the same global theme and we use the same range of fonts. This way, the user can understand that he/she is jumping to another section, without getting lost.
[Definition] Ordering refers to "the ways of arranging the path viewers take through the visualization" (Segel and Heer).
[Our application] We use User Directed Path, since "the user must select a path among multiple alternatives" to navigate in our poster. Despite the implicit user reading direction, the user is free to jump around and consult the section he/she wants, thank to the table of clickable contents. Additionally, as we have multiple tabs for some of our visualizations, the user is not requiered to follow any specific order among those tabs.

[Definition] Interactivity refers to "the different ways a user can manipulate the visualization" (Segel and Heer).
[Our application] We use both Hover Highlighting/Details and Filtering/Selection/Search, as we allow hovering in our plots to display additional information that the user may be interested in. We also allow in most of the plots a filtering and selection method to only show what may be relevant on a given moment via clickable legends items. Furthermore, one of the interactive maps also uses Navigation Buttons to navigate over months in the plot.

[Definition] Messaging refers to "the ways a visualization communicates observations and commentary to the viewer" (Segel and Heer).
[Our application] We use Captions/Headlines in all our plots, to make clear to the user what is she/he looking at. For each section, we use Introductory Text, to provide a context to the user, and Summary/Synthesis for each of plots to give our interpretation of the results.

For our story, we decide on separating the different parts using sections.
For the Introduction, along with the title, we present the problem, the city of Madrid, how is it divided in districts and the complete story, summarized, of Madrid Central. We use a plot so the user gets familiar with the shape on Madrid, and can start to explore where each district is, with an interactive hover over the map. The story of Madrid is initially hidden, in case a user is not interested in reading it, and can be expanded clicking on each image.

Right after, we show our findings in a section called Focus on Data, which is divided in Air Quality, Traffic and Relation between the air quality and traffic. In this section we display all our relevant plots and findings
Inside Air Quality, we start by showing where the air quality measurement stations are located, so the user can get acquainted with the topic, and with what it is going to be discussed in the section. Below is the Comparison of gases with the previous year. This allows the user, with different tabs, to see how the gases evolved from the same date, but the year before, to display the changes in Madrid Central, Outside Madrid Central, Close to Madrid Central and Far of Madrid Central.


After that, we relate to the european air quality standards with a short gif, where the user can stop and move frames if desired, displaying the evolution of $NO_2$ with sizes and colors, so it is more user friendly and approachable.

Using the same categories as the ones that are displayed in the map, we plot average concentrations over the Madrid Central process in different locations. These barplots allow to investigate the evolution and change in a more clear way, allowing us to obtain results and ending our analysis of air quality.

Now we get to the Traffic analysis, where we give the user an overview, showing a map with the location of all the measurement stations.

After that, we start focusing on the area of Madrid Central, plotting the traffic intensity in each point before and during the measures, along with a difference, to show the changes that these measures supposed.

That visualization inform us of the evolution of Madrid Central, but, to really detect its efficacy, we need to compare it with the total traffic in the city of Madrid. For that, we use the following visualization, where we compare the weekly evolution before and during the measures, using both data from inside and outside Madrid Central. We use the same colors for each day of the week to facilitate the understanding of the plot from the user perspective.

Analyzing using averages is not enough to detect if the difference is due to outliers, or if it is because the measures really work. To show that to the user, we plot a comparison of the different years we are plotting, comparing against the average outside Madrid Central, therefore showing a more accurate evolution of the timeline.

To finish with the traffic, we display the total timeline, to show the user special keydates, and display the change using that dates as aid to understand the changes, and show if the measures really worked or not.

The last section where we use plots is in the Relation between air quality and traffic. In this section we relate the air quality with the traffic. We do not find any direct relationship between the two, and we show that to the user with a correlation matrix. We use multiple to prove that there is no relation in Madrid Central before and during, and in Madrid city as a whole.

What went well?
We found useful data to investigate our idea and do our project. Then we could explore, analyze and visualize the different datasets that we found, dealing with much information. Not only the amount of data was for some datasets huge, but also the type of data was different. Changing coordinate systems, having to transform records of measurements quantified each 15 minutes from almost 5 years to a more reasonable data, that could be used to obtain visualizations and results, or working with distinct measurements of gas concentrations.
Then we explored the data, finding some insights and giving us an idea of what analysis did we want to do, along with the corresponding visualizations, to show in the webpage in an informative way.
Using the webpage format, we have been able to tell our story in a more interactive way and show it in a concise and direct way. Thanks to our visualizations, we arrived at relevant conclusions that we could not have observed if it was not because of an exhaustive analysis.
What is still missing? What could be improved? Why?
We could have incorporated more social related data, like socioeconomic data, but we believe that they are not relevant enough for our study, since we focus on discovering whether or not the environmental measure of Madrid Central reduced traffic intensity and pollution in the city center. So it is directly related to the well-being of people. Perhaps we could have used data on respiratory diseases, to see if the measure improved this condition.
We thought of using data we found about the public bicycle system established in the city, to see if there was a noticeable increase in the use of this public transport, but finally we could not include an in depth analysis of it due to time constraints.
Related to the socioeconomic data, we read a study where they used sales data of shops in Madrid Central to see how they were affected by the measure. But we did not want to prove anything like that, as it felt out of the scope of what we wanted to show.
Regarding the analysis of the data, we could have generated more charts or maps that would help us to better understand the data, or use different datasets such as weather conditions to support our conclusions, although we believe that, together the conclusions we obtain with different reports and news found, help conclude about the effectivity of Madrid Central.
We decided not to use Machine Learning in our project, as we believe it do not give any insights or new ideas that cannot be shown with the data analysis we carry out, therefore we felt it would be unnecessary to insert a machine learning model when it does not add anything to our narrative.
If we wanted to really use a machine learning method, we could have used a more complex model, to work with time series analysis and prediction, and we felt that it was outside the course content, and we wanted to use our time in tasks that were more related to what we learnt in the course.
*Disclaimer: All members have collaborated to a greater or lesser extent. The table shows who has been in charge of a specific part, or who has contributed more to it.*